linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: fix the inaccurate memory statistics issue for users
@ 2025-05-24  1:59 Baolin Wang
  2025-05-30  3:53 ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-05-24  1:59 UTC (permalink / raw)
  To: akpm, david, shakeel.butt
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	donettom, aboorvad, sj, baolin.wang, linux-mm, linux-fsdevel,
	linux-kernel

On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
      1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd

The main reason is that the batch size of the percpu counter is quite large
on these machines, caching a significant percpu value, since converting mm's
rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
stats into percpu_counter"). Intuitively, the batch number should be optimized,
but on some paths, performance may take precedence over statistical accuracy.
Therefore, introducing a new interface to add the percpu statistical count
and display it to users, which can remove the confusion. In addition, this
change is not expected to be on a performance-critical path, so the modification
should be acceptable.

Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Tested-by Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from RFC:
 - Collect reviewed and tested tags. Thanks.
 - Add Fixes tag.
---
 fs/proc/task_mmu.c | 14 +++++++-------
 include/linux/mm.h |  5 +++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f629e6526935 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	unsigned long text, lib, swap, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
-	anon = get_mm_counter(mm, MM_ANONPAGES);
-	file = get_mm_counter(mm, MM_FILEPAGES);
-	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
+	file = get_mm_counter_sum(mm, MM_FILEPAGES);
+	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	text = min(text, mm->exec_vm << PAGE_SHIFT);
 	lib = (mm->exec_vm << PAGE_SHIFT) - text;
 
-	swap = get_mm_counter(mm, MM_SWAPENTS);
+	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared = get_mm_counter(mm, MM_FILEPAGES) +
-			get_mm_counter(mm, MM_SHMEMPAGES);
+	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
+			get_mm_counter_sum(mm, MM_SHMEMPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
-	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 185424858f23..15ec5cfe9515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 	return percpu_counter_read_positive(&mm->rss_stat[member]);
 }
 
+static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
+{
+	return percpu_counter_sum_positive(&mm->rss_stat[member]);
+}
+
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-24  1:59 [PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
@ 2025-05-30  3:53 ` Andrew Morton
  2025-05-30 13:39   ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2025-05-30  3:53 UTC (permalink / raw)
  To: Baolin Wang
  Cc: david, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, donettom, aboorvad, sj, linux-mm, linux-fsdevel,
	linux-kernel

On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> On some large machines with a high number of CPUs running a 64K pagesize
> kernel, we found that the 'RES' field is always 0 displayed by the top
> command for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")

Three years ago.

> Tested-by Donet Tom <donettom@linux.ibm.com>
> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: SeongJae Park <sj@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Thanks, I added cc:stable to this.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-30  3:53 ` Andrew Morton
@ 2025-05-30 13:39   ` Michal Hocko
  2025-05-30 23:00     ` Andrew Morton
  2025-06-03  8:08     ` Baolin Wang
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2025-05-30 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Baolin Wang, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel

On Thu 29-05-25 20:53:13, Andrew Morton wrote:
> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
> > On some large machines with a high number of CPUs running a 64K pagesize
> > kernel, we found that the 'RES' field is always 0 displayed by the top
> > command for some processes, which will cause a lot of confusion for users.
> > 
> >     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
> >       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> > 
> > The main reason is that the batch size of the percpu counter is quite large
> > on these machines, caching a significant percpu value, since converting mm's
> > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > stats into percpu_counter"). Intuitively, the batch number should be optimized,
> > but on some paths, performance may take precedence over statistical accuracy.
> > Therefore, introducing a new interface to add the percpu statistical count
> > and display it to users, which can remove the confusion. In addition, this
> > change is not expected to be on a performance-critical path, so the modification
> > should be acceptable.
> > 
> > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> 
> Three years ago.
> 
> > Tested-by Donet Tom <donettom@linux.ibm.com>
> > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Acked-by: SeongJae Park <sj@kernel.org>
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Thanks, I added cc:stable to this.

I have only noticed this new posting now. I do not think this is a
stable material. I am also not convinced that the impact of the pcp lock
exposure to the userspace has been properly analyzed and documented in
the changelog. I am not nacking the patch (yet) but I would like to see
a serious analyses that this has been properly thought through.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-30 13:39   ` Michal Hocko
@ 2025-05-30 23:00     ` Andrew Morton
  2025-06-03  8:08     ` Baolin Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2025-05-30 23:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baolin Wang, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, 30 May 2025 15:39:36 +0200 Michal Hocko <mhocko@suse.com> wrote:

> > > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> > 
> > Three years ago.
> > 
> > > Tested-by Donet Tom <donettom@linux.ibm.com>
> > > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Acked-by: SeongJae Park <sj@kernel.org>
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > 
> > Thanks, I added cc:stable to this.
> 
> I have only noticed this new posting now. I do not think this is a
> stable material. I am also not convinced that the impact of the pcp lock
> exposure to the userspace has been properly analyzed and documented in
> the changelog. I am not nacking the patch (yet) but I would like to see
> a serious analyses that this has been properly thought through.

Thanks.  I'll move this into the mm-new branch while we work through
these things.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-05-30 13:39   ` Michal Hocko
  2025-05-30 23:00     ` Andrew Morton
@ 2025-06-03  8:08     ` Baolin Wang
  2025-06-03  8:15       ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-03  8:08 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: david, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, donettom, aboorvad, sj, linux-mm, linux-fsdevel,
	linux-kernel



On 2025/5/30 21:39, Michal Hocko wrote:
> On Thu 29-05-25 20:53:13, Andrew Morton wrote:
>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>
>>> On some large machines with a high number of CPUs running a 64K pagesize
>>> kernel, we found that the 'RES' field is always 0 displayed by the top
>>> command for some processes, which will cause a lot of confusion for users.
>>>
>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>>
>>> The main reason is that the batch size of the percpu counter is quite large
>>> on these machines, caching a significant percpu value, since converting mm's
>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>>> but on some paths, performance may take precedence over statistical accuracy.
>>> Therefore, introducing a new interface to add the percpu statistical count
>>> and display it to users, which can remove the confusion. In addition, this
>>> change is not expected to be on a performance-critical path, so the modification
>>> should be acceptable.
>>>
>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>>
>> Three years ago.
>>
>>> Tested-by Donet Tom <donettom@linux.ibm.com>
>>> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> Acked-by: SeongJae Park <sj@kernel.org>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> Thanks, I added cc:stable to this.
> 
> I have only noticed this new posting now. I do not think this is a
> stable material. I am also not convinced that the impact of the pcp lock
> exposure to the userspace has been properly analyzed and documented in
> the changelog. I am not nacking the patch (yet) but I would like to see
> a serious analyses that this has been properly thought through.

Good point. I did a quick measurement on my 32 cores Arm machine. I ran 
two workloads, one is the 'top' command: top -d 1 (updating every 
second). Another workload is kernel building (time make -j32).

 From the following data, I did not see any significant impact of the 
patch changes on the execution of the kernel building workload.

w/o patch:
real	4m33.887s
user	118m24.153s
sys	9m51.402s

w/ patch:
real	4m34.495s
user	118m21.739s
sys	9m39.232s


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03  8:08     ` Baolin Wang
@ 2025-06-03  8:15       ` Michal Hocko
  2025-06-03  8:32         ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-06-03  8:15 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel

On Tue 03-06-25 16:08:21, Baolin Wang wrote:
> 
> 
> On 2025/5/30 21:39, Michal Hocko wrote:
> > On Thu 29-05-25 20:53:13, Andrew Morton wrote:
> > > On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > > 
> > > > On some large machines with a high number of CPUs running a 64K pagesize
> > > > kernel, we found that the 'RES' field is always 0 displayed by the top
> > > > command for some processes, which will cause a lot of confusion for users.
> > > > 
> > > >      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > > >   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
> > > >        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> > > > 
> > > > The main reason is that the batch size of the percpu counter is quite large
> > > > on these machines, caching a significant percpu value, since converting mm's
> > > > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > > > stats into percpu_counter"). Intuitively, the batch number should be optimized,
> > > > but on some paths, performance may take precedence over statistical accuracy.
> > > > Therefore, introducing a new interface to add the percpu statistical count
> > > > and display it to users, which can remove the confusion. In addition, this
> > > > change is not expected to be on a performance-critical path, so the modification
> > > > should be acceptable.
> > > > 
> > > > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> > > 
> > > Three years ago.
> > > 
> > > > Tested-by Donet Tom <donettom@linux.ibm.com>
> > > > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > Acked-by: SeongJae Park <sj@kernel.org>
> > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > 
> > > Thanks, I added cc:stable to this.
> > 
> > I have only noticed this new posting now. I do not think this is a
> > stable material. I am also not convinced that the impact of the pcp lock
> > exposure to the userspace has been properly analyzed and documented in
> > the changelog. I am not nacking the patch (yet) but I would like to see
> > a serious analyses that this has been properly thought through.
> 
> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
> workloads, one is the 'top' command: top -d 1 (updating every second).
> Another workload is kernel building (time make -j32).
> 
> From the following data, I did not see any significant impact of the patch
> changes on the execution of the kernel building workload.

I do not think this is really representative of an adverse workload. I
believe you need to have a look which potentially sensitive kernel code
paths run with the lock held how would a busy loop over affected proc
files influence those in the worst case. Maybe there are none of such
kernel code paths to really worry about. This should be a part of the
changelog though.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03  8:15       ` Michal Hocko
@ 2025-06-03  8:32         ` Baolin Wang
  2025-06-03 10:28           ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-03  8:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel



On 2025/6/3 16:15, Michal Hocko wrote:
> On Tue 03-06-25 16:08:21, Baolin Wang wrote:
>>
>>
>> On 2025/5/30 21:39, Michal Hocko wrote:
>>> On Thu 29-05-25 20:53:13, Andrew Morton wrote:
>>>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>> On some large machines with a high number of CPUs running a 64K pagesize
>>>>> kernel, we found that the 'RES' field is always 0 displayed by the top
>>>>> command for some processes, which will cause a lot of confusion for users.
>>>>>
>>>>>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>>>    875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>>>>         1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>>>>
>>>>> The main reason is that the batch size of the percpu counter is quite large
>>>>> on these machines, caching a significant percpu value, since converting mm's
>>>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>>>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>>>>> but on some paths, performance may take precedence over statistical accuracy.
>>>>> Therefore, introducing a new interface to add the percpu statistical count
>>>>> and display it to users, which can remove the confusion. In addition, this
>>>>> change is not expected to be on a performance-critical path, so the modification
>>>>> should be acceptable.
>>>>>
>>>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>>>>
>>>> Three years ago.
>>>>
>>>>> Tested-by Donet Tom <donettom@linux.ibm.com>
>>>>> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>> Acked-by: SeongJae Park <sj@kernel.org>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>
>>>> Thanks, I added cc:stable to this.
>>>
>>> I have only noticed this new posting now. I do not think this is a
>>> stable material. I am also not convinced that the impact of the pcp lock
>>> exposure to the userspace has been properly analyzed and documented in
>>> the changelog. I am not nacking the patch (yet) but I would like to see
>>> a serious analyses that this has been properly thought through.
>>
>> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
>> workloads, one is the 'top' command: top -d 1 (updating every second).
>> Another workload is kernel building (time make -j32).
>>
>>  From the following data, I did not see any significant impact of the patch
>> changes on the execution of the kernel building workload.
> 
> I do not think this is really representative of an adverse workload. I
> believe you need to have a look which potentially sensitive kernel code
> paths run with the lock held how would a busy loop over affected proc
> files influence those in the worst case. Maybe there are none of such
> kernel code paths to really worry about. This should be a part of the
> changelog though.

IMO, kernel code paths usually have batch caching to avoid lock 
contention, so I think the impact on kernel code paths is not that 
obvious. Therefore, I also think it's hard to find an adverse workload.

How about adding the following comments in the commit log?
"
I did a quick measurement on my 32 cores Arm machine. I ran two 
workloads, one is the 'top' command: top -d 1 (updating every second). 
Another workload is kernel building (time make -j32).

 From the following data, I did not see any significant impact of the 
patch changes on the execution of the kernel building workload. In 
addition, kernel code paths usually have batch caching to avoid pcp lock 
contention, so I think the impact on kernel code paths is not that 
obvious even the pcp lock is exposed to the userspace.

w/o patch:
real    4m33.887s
user    118m24.153s
sys    9m51.402s

w/ patch:
real    4m34.495s
user    118m21.739s
sys    9m39.232s
"


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03  8:32         ` Baolin Wang
@ 2025-06-03 10:28           ` Michal Hocko
  2025-06-03 14:22             ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-06-03 10:28 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel

On Tue 03-06-25 16:32:35, Baolin Wang wrote:
> 
> 
> On 2025/6/3 16:15, Michal Hocko wrote:
> > On Tue 03-06-25 16:08:21, Baolin Wang wrote:
> > > 
> > > 
> > > On 2025/5/30 21:39, Michal Hocko wrote:
> > > > On Thu 29-05-25 20:53:13, Andrew Morton wrote:
> > > > > On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > > > > 
> > > > > > On some large machines with a high number of CPUs running a 64K pagesize
> > > > > > kernel, we found that the 'RES' field is always 0 displayed by the top
> > > > > > command for some processes, which will cause a lot of confusion for users.
> > > > > > 
> > > > > >       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > > > > >    875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
> > > > > >         1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> > > > > > 
> > > > > > The main reason is that the batch size of the percpu counter is quite large
> > > > > > on these machines, caching a significant percpu value, since converting mm's
> > > > > > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > > > > > stats into percpu_counter"). Intuitively, the batch number should be optimized,
> > > > > > but on some paths, performance may take precedence over statistical accuracy.
> > > > > > Therefore, introducing a new interface to add the percpu statistical count
> > > > > > and display it to users, which can remove the confusion. In addition, this
> > > > > > change is not expected to be on a performance-critical path, so the modification
> > > > > > should be acceptable.
> > > > > > 
> > > > > > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> > > > > 
> > > > > Three years ago.
> > > > > 
> > > > > > Tested-by Donet Tom <donettom@linux.ibm.com>
> > > > > > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > > > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > > > Acked-by: SeongJae Park <sj@kernel.org>
> > > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > 
> > > > > Thanks, I added cc:stable to this.
> > > > 
> > > > I have only noticed this new posting now. I do not think this is a
> > > > stable material. I am also not convinced that the impact of the pcp lock
> > > > exposure to the userspace has been properly analyzed and documented in
> > > > the changelog. I am not nacking the patch (yet) but I would like to see
> > > > a serious analyses that this has been properly thought through.
> > > 
> > > Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
> > > workloads, one is the 'top' command: top -d 1 (updating every second).
> > > Another workload is kernel building (time make -j32).
> > > 
> > >  From the following data, I did not see any significant impact of the patch
> > > changes on the execution of the kernel building workload.
> > 
> > I do not think this is really representative of an adverse workload. I
> > believe you need to have a look which potentially sensitive kernel code
> > paths run with the lock held how would a busy loop over affected proc
> > files influence those in the worst case. Maybe there are none of such
> > kernel code paths to really worry about. This should be a part of the
> > changelog though.
> 
> IMO, kernel code paths usually have batch caching to avoid lock contention,
> so I think the impact on kernel code paths is not that obvious.

This is a very generic statement. Does this refer to the existing pcp
locking usage in the kernel? Have you evaluated existing users?

> Therefore, I
> also think it's hard to find an adverse workload.
> 
> How about adding the following comments in the commit log?
> "
> I did a quick measurement on my 32 cores Arm machine. I ran two workloads,
> one is the 'top' command: top -d 1 (updating every second). Another workload
> is kernel building (time make -j32).

This test doesn't really do much to trigger an actual lock contention as
already mentioned.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03 10:28           ` Michal Hocko
@ 2025-06-03 14:22             ` Baolin Wang
  2025-06-03 14:48               ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-03 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel



On 2025/6/3 18:28, Michal Hocko wrote:
> On Tue 03-06-25 16:32:35, Baolin Wang wrote:
>>
>>
>> On 2025/6/3 16:15, Michal Hocko wrote:
>>> On Tue 03-06-25 16:08:21, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/5/30 21:39, Michal Hocko wrote:
>>>>> On Thu 29-05-25 20:53:13, Andrew Morton wrote:
>>>>>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>> On some large machines with a high number of CPUs running a 64K pagesize
>>>>>>> kernel, we found that the 'RES' field is always 0 displayed by the top
>>>>>>> command for some processes, which will cause a lot of confusion for users.
>>>>>>>
>>>>>>>        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>>>>>     875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>>>>>>          1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>>>>>>
>>>>>>> The main reason is that the batch size of the percpu counter is quite large
>>>>>>> on these machines, caching a significant percpu value, since converting mm's
>>>>>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>>>>>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>>>>>>> but on some paths, performance may take precedence over statistical accuracy.
>>>>>>> Therefore, introducing a new interface to add the percpu statistical count
>>>>>>> and display it to users, which can remove the confusion. In addition, this
>>>>>>> change is not expected to be on a performance-critical path, so the modification
>>>>>>> should be acceptable.
>>>>>>>
>>>>>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>>>>>>
>>>>>> Three years ago.
>>>>>>
>>>>>>> Tested-by Donet Tom <donettom@linux.ibm.com>
>>>>>>> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>>>> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>> Acked-by: SeongJae Park <sj@kernel.org>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>
>>>>>> Thanks, I added cc:stable to this.
>>>>>
>>>>> I have only noticed this new posting now. I do not think this is a
>>>>> stable material. I am also not convinced that the impact of the pcp lock
>>>>> exposure to the userspace has been properly analyzed and documented in
>>>>> the changelog. I am not nacking the patch (yet) but I would like to see
>>>>> a serious analyses that this has been properly thought through.
>>>>
>>>> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
>>>> workloads, one is the 'top' command: top -d 1 (updating every second).
>>>> Another workload is kernel building (time make -j32).
>>>>
>>>>   From the following data, I did not see any significant impact of the patch
>>>> changes on the execution of the kernel building workload.
>>>
>>> I do not think this is really representative of an adverse workload. I
>>> believe you need to have a look which potentially sensitive kernel code
>>> paths run with the lock held how would a busy loop over affected proc
>>> files influence those in the worst case. Maybe there are none of such
>>> kernel code paths to really worry about. This should be a part of the
>>> changelog though.
>>
>> IMO, kernel code paths usually have batch caching to avoid lock contention,
>> so I think the impact on kernel code paths is not that obvious.
> 
> This is a very generic statement. Does this refer to the existing pcp
> locking usage in the kernel? Have you evaluated existing users?

Let me try to clarify further.

The 'mm->rss_stat' is updated by using add_mm_counter(), 
dec/inc_mm_counter(), which are all wrappers around 
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is 
percpu batch caching to avoid 'fbc->lock' contention. This patch changes 
task_mem() and task_statm() to get the accurate mm counters under the 
'fbc->lock', but this will not exacerbate kernel 'mm->rss_stat' lock 
contention due to the the percpu batch caching of the mm counters.

You might argue that my test cases cannot demonstrate an actual lock 
contention, but they have already shown that there is no significant 
'fbc->lock' contention when the kernel updates 'mm->rss_stat'.

>> Therefore, I
>> also think it's hard to find an adverse workload.
>>
>> How about adding the following comments in the commit log?
>> "
>> I did a quick measurement on my 32 cores Arm machine. I ran two workloads,
>> one is the 'top' command: top -d 1 (updating every second). Another workload
>> is kernel building (time make -j32).
> 
> This test doesn't really do much to trigger an actual lock contention as
> already mentioned.
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03 14:22             ` Baolin Wang
@ 2025-06-03 14:48               ` Michal Hocko
  2025-06-03 17:29                 ` Shakeel Butt
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-06-03 14:48 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel

On Tue 03-06-25 22:22:46, Baolin Wang wrote:
> Let me try to clarify further.
> 
> The 'mm->rss_stat' is updated by using add_mm_counter(),
> dec/inc_mm_counter(), which are all wrappers around
> percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
> batch caching to avoid 'fbc->lock' contention. 

OK, this is exactly the line of argument I was looking for. If _all_
updates done in the kernel are using batching and therefore the lock is
only held every N (percpu_counter_batch) updates then a risk of locking
contention would be decreased. This is worth having a note in the
changelog.

> This patch changes task_mem()
> and task_statm() to get the accurate mm counters under the 'fbc->lock', but
> this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
> the percpu batch caching of the mm counters.
> 
> You might argue that my test cases cannot demonstrate an actual lock
> contention, but they have already shown that there is no significant
> 'fbc->lock' contention when the kernel updates 'mm->rss_stat'.

I was arguing that `top -d 1' doesn't really represent a potential
adverse usage. These proc files are generally readable so I would be
expecting something like busy loop read while process tries to update
counters to see the worst case scenario. If that is barely visible then
we can conclude a normal use wouldn't even notice.

See my point?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03 14:48               ` Michal Hocko
@ 2025-06-03 17:29                 ` Shakeel Butt
  2025-06-04 12:46                   ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-06-03 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baolin Wang, Andrew Morton, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Jun 03, 2025 at 04:48:08PM +0200, Michal Hocko wrote:
> On Tue 03-06-25 22:22:46, Baolin Wang wrote:
> > Let me try to clarify further.
> > 
> > The 'mm->rss_stat' is updated by using add_mm_counter(),
> > dec/inc_mm_counter(), which are all wrappers around
> > percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
> > batch caching to avoid 'fbc->lock' contention. 
> 
> OK, this is exactly the line of argument I was looking for. If _all_
> updates done in the kernel are using batching and therefore the lock is
> only held every N (percpu_counter_batch) updates then a risk of locking
> contention would be decreased. This is worth having a note in the
> changelog.
> 
> > This patch changes task_mem()
> > and task_statm() to get the accurate mm counters under the 'fbc->lock', but
> > this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
> > the percpu batch caching of the mm counters.
> > 
> > You might argue that my test cases cannot demonstrate an actual lock
> > contention, but they have already shown that there is no significant
> > 'fbc->lock' contention when the kernel updates 'mm->rss_stat'.
> 
> I was arguing that `top -d 1' doesn't really represent a potential
> adverse usage. These proc files are generally readable so I would be
> expecting something like busy loop read while process tries to update
> counters to see the worst case scenario. If that is barely visible then
> we can conclude a normal use wouldn't even notice.
> 

Baolin, please run stress-ng command that stresses minor anon page
faults in multiple threads and then run multiple bash scripts which cat
/proc/pidof(stress-ng)/status. That should be how much the stress-ng
process is impacted by the parallel status readers versus without them.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-03 17:29                 ` Shakeel Butt
@ 2025-06-04 12:46                   ` Baolin Wang
  2025-06-04 13:46                     ` Vlastimil Babka
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-04 12:46 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, donettom, aboorvad, sj, linux-mm, linux-fsdevel,
	linux-kernel



On 2025/6/4 01:29, Shakeel Butt wrote:
> On Tue, Jun 03, 2025 at 04:48:08PM +0200, Michal Hocko wrote:
>> On Tue 03-06-25 22:22:46, Baolin Wang wrote:
>>> Let me try to clarify further.
>>>
>>> The 'mm->rss_stat' is updated by using add_mm_counter(),
>>> dec/inc_mm_counter(), which are all wrappers around
>>> percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
>>> batch caching to avoid 'fbc->lock' contention.
>>
>> OK, this is exactly the line of argument I was looking for. If _all_
>> updates done in the kernel are using batching and therefore the lock is
>> only held every N (percpu_counter_batch) updates then a risk of locking
>> contention would be decreased. This is worth having a note in the
>> changelog.

OK.

>>> This patch changes task_mem()
>>> and task_statm() to get the accurate mm counters under the 'fbc->lock', but
>>> this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
>>> the percpu batch caching of the mm counters.
>>>
>>> You might argue that my test cases cannot demonstrate an actual lock
>>> contention, but they have already shown that there is no significant
>>> 'fbc->lock' contention when the kernel updates 'mm->rss_stat'.
>>
>> I was arguing that `top -d 1' doesn't really represent a potential
>> adverse usage. These proc files are generally readable so I would be
>> expecting something like busy loop read while process tries to update
>> counters to see the worst case scenario. If that is barely visible then
>> we can conclude a normal use wouldn't even notice.

OK.

> Baolin, please run stress-ng command that stresses minor anon page
> faults in multiple threads and then run multiple bash scripts which cat
> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
> process is impacted by the parallel status readers versus without them.

Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32 
--perf -t 1m' command, while simultaneously running the following 
scripts to read the /proc/pidof(stress-ng)/status for each thread.

 From the following data, I did not observe any obvious impact of this 
patch on the stress-ng tests when repeatedly reading the 
/proc/pidof(stress-ng)/status.

w/o patch
stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles 
          59.767 B/sec
stress-ng: info:  [6891]          1,472,101,565,760 Instructions 
          22.033 B/sec (0.369 instr. per cycle)
stress-ng: info:  [6891]                 36,287,456 Page Faults Total 
           0.543 M/sec
stress-ng: info:  [6891]                 36,287,456 Page Faults Minor 
           0.543 M/sec

w/ patch
stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles 
          60.177 B/sec
stress-ng: info:  [6872]          1,484,856,150,976 Instructions 
          22.235 B/sec (0.369 instr. per cycle)
stress-ng: info:  [6872]                 36,547,456 Page Faults Total 
           0.547 M/sec
stress-ng: info:  [6872]                 36,547,456 Page Faults Minor 
           0.547 M/sec

=========================
#!/bin/bash

# Get the PIDs of stress-ng processes
PIDS=$(pgrep stress-ng)

# Loop through each PID and monitor /proc/[pid]/status
for PID in $PIDS; do
     while true; do
         cat /proc/$PID/status
	usleep 100000
     done &
done


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-04 12:46                   ` Baolin Wang
@ 2025-06-04 13:46                     ` Vlastimil Babka
  2025-06-04 14:16                       ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-06-04 13:46 UTC (permalink / raw)
  To: Baolin Wang, Shakeel Butt, Michal Hocko
  Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb,
	donettom, aboorvad, sj, linux-mm, linux-fsdevel, linux-kernel

On 6/4/25 14:46, Baolin Wang wrote:
>> Baolin, please run stress-ng command that stresses minor anon page
>> faults in multiple threads and then run multiple bash scripts which cat
>> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
>> process is impacted by the parallel status readers versus without them.
> 
> Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32 
> --perf -t 1m' command, while simultaneously running the following 
> scripts to read the /proc/pidof(stress-ng)/status for each thread.

How many of those scripts?

>  From the following data, I did not observe any obvious impact of this 
> patch on the stress-ng tests when repeatedly reading the 
> /proc/pidof(stress-ng)/status.
> 
> w/o patch
> stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles 
>           59.767 B/sec
> stress-ng: info:  [6891]          1,472,101,565,760 Instructions 
>           22.033 B/sec (0.369 instr. per cycle)
> stress-ng: info:  [6891]                 36,287,456 Page Faults Total 
>            0.543 M/sec
> stress-ng: info:  [6891]                 36,287,456 Page Faults Minor 
>            0.543 M/sec
> 
> w/ patch
> stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles 
>           60.177 B/sec
> stress-ng: info:  [6872]          1,484,856,150,976 Instructions 
>           22.235 B/sec (0.369 instr. per cycle)
> stress-ng: info:  [6872]                 36,547,456 Page Faults Total 
>            0.547 M/sec
> stress-ng: info:  [6872]                 36,547,456 Page Faults Minor 
>            0.547 M/sec
> 
> =========================
> #!/bin/bash
> 
> # Get the PIDs of stress-ng processes
> PIDS=$(pgrep stress-ng)
> 
> # Loop through each PID and monitor /proc/[pid]/status
> for PID in $PIDS; do
>      while true; do
>          cat /proc/$PID/status
> 	usleep 100000

Hm but this limits the reading to 10 per second? If we want to simulate an
adversary process, it should be without the sleeps I think?

>      done &
> done



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-04 13:46                     ` Vlastimil Babka
@ 2025-06-04 14:16                       ` Baolin Wang
  2025-06-04 14:27                         ` Vlastimil Babka
  2025-06-04 16:54                         ` Shakeel Butt
  0 siblings, 2 replies; 18+ messages in thread
From: Baolin Wang @ 2025-06-04 14:16 UTC (permalink / raw)
  To: Vlastimil Babka, Shakeel Butt, Michal Hocko
  Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb,
	donettom, aboorvad, sj, linux-mm, linux-fsdevel, linux-kernel



On 2025/6/4 21:46, Vlastimil Babka wrote:
> On 6/4/25 14:46, Baolin Wang wrote:
>>> Baolin, please run stress-ng command that stresses minor anon page
>>> faults in multiple threads and then run multiple bash scripts which cat
>>> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
>>> process is impacted by the parallel status readers versus without them.
>>
>> Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
>> --perf -t 1m' command, while simultaneously running the following
>> scripts to read the /proc/pidof(stress-ng)/status for each thread.
> 
> How many of those scripts?

1 script, but will start 32 threads to read each stress-ng thread's 
status interface.

>>   From the following data, I did not observe any obvious impact of this
>> patch on the stress-ng tests when repeatedly reading the
>> /proc/pidof(stress-ng)/status.
>>
>> w/o patch
>> stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles
>>            59.767 B/sec
>> stress-ng: info:  [6891]          1,472,101,565,760 Instructions
>>            22.033 B/sec (0.369 instr. per cycle)
>> stress-ng: info:  [6891]                 36,287,456 Page Faults Total
>>             0.543 M/sec
>> stress-ng: info:  [6891]                 36,287,456 Page Faults Minor
>>             0.543 M/sec
>>
>> w/ patch
>> stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles
>>            60.177 B/sec
>> stress-ng: info:  [6872]          1,484,856,150,976 Instructions
>>            22.235 B/sec (0.369 instr. per cycle)
>> stress-ng: info:  [6872]                 36,547,456 Page Faults Total
>>             0.547 M/sec
>> stress-ng: info:  [6872]                 36,547,456 Page Faults Minor
>>             0.547 M/sec
>>
>> =========================
>> #!/bin/bash
>>
>> # Get the PIDs of stress-ng processes
>> PIDS=$(pgrep stress-ng)
>>
>> # Loop through each PID and monitor /proc/[pid]/status
>> for PID in $PIDS; do
>>       while true; do
>>           cat /proc/$PID/status
>> 	usleep 100000
> 
> Hm but this limits the reading to 10 per second? If we want to simulate an
> adversary process, it should be without the sleeps I think?

OK. I drop the usleep, and I still can not see obvious impact.

w/o patch:
stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles 
          67.327 B/sec
stress-ng: info:  [6848]          1,616,524,844,832 Instructions 
          24.740 B/sec (0.367 instr. per cycle)
stress-ng: info:  [6848]                 39,529,792 Page Faults Total 
           0.605 M/sec
stress-ng: info:  [6848]                 39,529,792 Page Faults Minor 
           0.605 M/sec

w/patch:
stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles 
          68.382 B/sec
stress-ng: info:  [2485]          1,615,101,503,296 Instructions 
          24.750 B/sec (0.362 instr. per cycle)
stress-ng: info:  [2485]                 39,439,232 Page Faults Total 
           0.604 M/sec
stress-ng: info:  [2485]                 39,439,232 Page Faults Minor 
           0.604 M/sec


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-04 14:16                       ` Baolin Wang
@ 2025-06-04 14:27                         ` Vlastimil Babka
  2025-06-04 16:54                         ` Shakeel Butt
  1 sibling, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2025-06-04 14:27 UTC (permalink / raw)
  To: Baolin Wang, Shakeel Butt, Michal Hocko
  Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb,
	donettom, aboorvad, sj, linux-mm, linux-fsdevel, linux-kernel

On 6/4/25 16:16, Baolin Wang wrote:
>>>
>>> # Get the PIDs of stress-ng processes
>>> PIDS=$(pgrep stress-ng)
>>>
>>> # Loop through each PID and monitor /proc/[pid]/status
>>> for PID in $PIDS; do
>>>       while true; do
>>>           cat /proc/$PID/status
>>> 	usleep 100000
>> 
>> Hm but this limits the reading to 10 per second? If we want to simulate an
>> adversary process, it should be without the sleeps I think?
> 
> OK. I drop the usleep, and I still can not see obvious impact.

Thanks, that's reassuring.

> w/o patch:
> stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles 
>           67.327 B/sec
> stress-ng: info:  [6848]          1,616,524,844,832 Instructions 
>           24.740 B/sec (0.367 instr. per cycle)
> stress-ng: info:  [6848]                 39,529,792 Page Faults Total 
>            0.605 M/sec
> stress-ng: info:  [6848]                 39,529,792 Page Faults Minor 
>            0.605 M/sec
> 
> w/patch:
> stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles 
>           68.382 B/sec
> stress-ng: info:  [2485]          1,615,101,503,296 Instructions 
>           24.750 B/sec (0.362 instr. per cycle)
> stress-ng: info:  [2485]                 39,439,232 Page Faults Total 
>            0.604 M/sec
> stress-ng: info:  [2485]                 39,439,232 Page Faults Minor 
>            0.604 M/sec



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-04 14:16                       ` Baolin Wang
  2025-06-04 14:27                         ` Vlastimil Babka
@ 2025-06-04 16:54                         ` Shakeel Butt
  2025-06-05  0:48                           ` Baolin Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-06-04 16:54 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, david,
	lorenzo.stoakes, Liam.Howlett, rppt, surenb, donettom, aboorvad,
	sj, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jun 04, 2025 at 10:16:18PM +0800, Baolin Wang wrote:
> 
> 
> On 2025/6/4 21:46, Vlastimil Babka wrote:
> > On 6/4/25 14:46, Baolin Wang wrote:
> > > > Baolin, please run stress-ng command that stresses minor anon page
> > > > faults in multiple threads and then run multiple bash scripts which cat
> > > > /proc/pidof(stress-ng)/status. That should be how much the stress-ng
> > > > process is impacted by the parallel status readers versus without them.
> > > 
> > > Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
> > > --perf -t 1m' command, while simultaneously running the following
> > > scripts to read the /proc/pidof(stress-ng)/status for each thread.
> > 
> > How many of those scripts?
> 
> 1 script, but will start 32 threads to read each stress-ng thread's status
> interface.
> 
> > >   From the following data, I did not observe any obvious impact of this
> > > patch on the stress-ng tests when repeatedly reading the
> > > /proc/pidof(stress-ng)/status.
> > > 
> > > w/o patch
> > > stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles
> > >            59.767 B/sec
> > > stress-ng: info:  [6891]          1,472,101,565,760 Instructions
> > >            22.033 B/sec (0.369 instr. per cycle)
> > > stress-ng: info:  [6891]                 36,287,456 Page Faults Total
> > >             0.543 M/sec
> > > stress-ng: info:  [6891]                 36,287,456 Page Faults Minor
> > >             0.543 M/sec
> > > 
> > > w/ patch
> > > stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles
> > >            60.177 B/sec
> > > stress-ng: info:  [6872]          1,484,856,150,976 Instructions
> > >            22.235 B/sec (0.369 instr. per cycle)
> > > stress-ng: info:  [6872]                 36,547,456 Page Faults Total
> > >             0.547 M/sec
> > > stress-ng: info:  [6872]                 36,547,456 Page Faults Minor
> > >             0.547 M/sec
> > > 
> > > =========================
> > > #!/bin/bash
> > > 
> > > # Get the PIDs of stress-ng processes
> > > PIDS=$(pgrep stress-ng)
> > > 
> > > # Loop through each PID and monitor /proc/[pid]/status
> > > for PID in $PIDS; do
> > >       while true; do
> > >           cat /proc/$PID/status
> > > 	usleep 100000
> > 
> > Hm but this limits the reading to 10 per second? If we want to simulate an
> > adversary process, it should be without the sleeps I think?
> 
> OK. I drop the usleep, and I still can not see obvious impact.
> 
> w/o patch:
> stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles
> 67.327 B/sec
> stress-ng: info:  [6848]          1,616,524,844,832 Instructions
> 24.740 B/sec (0.367 instr. per cycle)
> stress-ng: info:  [6848]                 39,529,792 Page Faults Total
> 0.605 M/sec
> stress-ng: info:  [6848]                 39,529,792 Page Faults Minor
> 0.605 M/sec
> 
> w/patch:
> stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles
> 68.382 B/sec
> stress-ng: info:  [2485]          1,615,101,503,296 Instructions
> 24.750 B/sec (0.362 instr. per cycle)
> stress-ng: info:  [2485]                 39,439,232 Page Faults Total
> 0.604 M/sec
> stress-ng: info:  [2485]                 39,439,232 Page Faults Minor
> 0.604 M/sec

Is the above with 32 non-sleeping parallel reader scripts?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-04 16:54                         ` Shakeel Butt
@ 2025-06-05  0:48                           ` Baolin Wang
  2025-06-05  6:32                             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-05  0:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, david,
	lorenzo.stoakes, Liam.Howlett, rppt, surenb, donettom, aboorvad,
	sj, linux-mm, linux-fsdevel, linux-kernel



On 2025/6/5 00:54, Shakeel Butt wrote:
> On Wed, Jun 04, 2025 at 10:16:18PM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/6/4 21:46, Vlastimil Babka wrote:
>>> On 6/4/25 14:46, Baolin Wang wrote:
>>>>> Baolin, please run stress-ng command that stresses minor anon page
>>>>> faults in multiple threads and then run multiple bash scripts which cat
>>>>> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
>>>>> process is impacted by the parallel status readers versus without them.
>>>>
>>>> Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
>>>> --perf -t 1m' command, while simultaneously running the following
>>>> scripts to read the /proc/pidof(stress-ng)/status for each thread.
>>>
>>> How many of those scripts?
>>
>> 1 script, but will start 32 threads to read each stress-ng thread's status
>> interface.
>>
>>>>    From the following data, I did not observe any obvious impact of this
>>>> patch on the stress-ng tests when repeatedly reading the
>>>> /proc/pidof(stress-ng)/status.
>>>>
>>>> w/o patch
>>>> stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles
>>>>             59.767 B/sec
>>>> stress-ng: info:  [6891]          1,472,101,565,760 Instructions
>>>>             22.033 B/sec (0.369 instr. per cycle)
>>>> stress-ng: info:  [6891]                 36,287,456 Page Faults Total
>>>>              0.543 M/sec
>>>> stress-ng: info:  [6891]                 36,287,456 Page Faults Minor
>>>>              0.543 M/sec
>>>>
>>>> w/ patch
>>>> stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles
>>>>             60.177 B/sec
>>>> stress-ng: info:  [6872]          1,484,856,150,976 Instructions
>>>>             22.235 B/sec (0.369 instr. per cycle)
>>>> stress-ng: info:  [6872]                 36,547,456 Page Faults Total
>>>>              0.547 M/sec
>>>> stress-ng: info:  [6872]                 36,547,456 Page Faults Minor
>>>>              0.547 M/sec
>>>>
>>>> =========================
>>>> #!/bin/bash
>>>>
>>>> # Get the PIDs of stress-ng processes
>>>> PIDS=$(pgrep stress-ng)
>>>>
>>>> # Loop through each PID and monitor /proc/[pid]/status
>>>> for PID in $PIDS; do
>>>>        while true; do
>>>>            cat /proc/$PID/status
>>>> 	usleep 100000
>>>
>>> Hm but this limits the reading to 10 per second? If we want to simulate an
>>> adversary process, it should be without the sleeps I think?
>>
>> OK. I drop the usleep, and I still can not see obvious impact.
>>
>> w/o patch:
>> stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles
>> 67.327 B/sec
>> stress-ng: info:  [6848]          1,616,524,844,832 Instructions
>> 24.740 B/sec (0.367 instr. per cycle)
>> stress-ng: info:  [6848]                 39,529,792 Page Faults Total
>> 0.605 M/sec
>> stress-ng: info:  [6848]                 39,529,792 Page Faults Minor
>> 0.605 M/sec
>>
>> w/patch:
>> stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles
>> 68.382 B/sec
>> stress-ng: info:  [2485]          1,615,101,503,296 Instructions
>> 24.750 B/sec (0.362 instr. per cycle)
>> stress-ng: info:  [2485]                 39,439,232 Page Faults Total
>> 0.604 M/sec
>> stress-ng: info:  [2485]                 39,439,232 Page Faults Minor
>> 0.604 M/sec
> 
> Is the above with 32 non-sleeping parallel reader scripts?

Yes.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
  2025-06-05  0:48                           ` Baolin Wang
@ 2025-06-05  6:32                             ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2025-06-05  6:32 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Shakeel Butt, Vlastimil Babka, Andrew Morton, david,
	lorenzo.stoakes, Liam.Howlett, rppt, surenb, donettom, aboorvad,
	sj, linux-mm, linux-fsdevel, linux-kernel

On Thu 05-06-25 08:48:07, Baolin Wang wrote:
> 
> 
> On 2025/6/5 00:54, Shakeel Butt wrote:
> > On Wed, Jun 04, 2025 at 10:16:18PM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 2025/6/4 21:46, Vlastimil Babka wrote:
> > > > On 6/4/25 14:46, Baolin Wang wrote:
> > > > > > Baolin, please run stress-ng command that stresses minor anon page
> > > > > > faults in multiple threads and then run multiple bash scripts which cat
> > > > > > /proc/pidof(stress-ng)/status. That should be how much the stress-ng
> > > > > > process is impacted by the parallel status readers versus without them.
> > > > > 
> > > > > Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
> > > > > --perf -t 1m' command, while simultaneously running the following
> > > > > scripts to read the /proc/pidof(stress-ng)/status for each thread.
> > > > 
> > > > How many of those scripts?
> > > 
> > > 1 script, but will start 32 threads to read each stress-ng thread's status
> > > interface.
> > > 
> > > > >    From the following data, I did not observe any obvious impact of this
> > > > > patch on the stress-ng tests when repeatedly reading the
> > > > > /proc/pidof(stress-ng)/status.
> > > > > 
> > > > > w/o patch
> > > > > stress-ng: info:  [6891]          3,993,235,331,584 CPU Cycles
> > > > >             59.767 B/sec
> > > > > stress-ng: info:  [6891]          1,472,101,565,760 Instructions
> > > > >             22.033 B/sec (0.369 instr. per cycle)
> > > > > stress-ng: info:  [6891]                 36,287,456 Page Faults Total
> > > > >              0.543 M/sec
> > > > > stress-ng: info:  [6891]                 36,287,456 Page Faults Minor
> > > > >              0.543 M/sec
> > > > > 
> > > > > w/ patch
> > > > > stress-ng: info:  [6872]          4,018,592,975,968 CPU Cycles
> > > > >             60.177 B/sec
> > > > > stress-ng: info:  [6872]          1,484,856,150,976 Instructions
> > > > >             22.235 B/sec (0.369 instr. per cycle)
> > > > > stress-ng: info:  [6872]                 36,547,456 Page Faults Total
> > > > >              0.547 M/sec
> > > > > stress-ng: info:  [6872]                 36,547,456 Page Faults Minor
> > > > >              0.547 M/sec
> > > > > 
> > > > > =========================
> > > > > #!/bin/bash
> > > > > 
> > > > > # Get the PIDs of stress-ng processes
> > > > > PIDS=$(pgrep stress-ng)
> > > > > 
> > > > > # Loop through each PID and monitor /proc/[pid]/status
> > > > > for PID in $PIDS; do
> > > > >        while true; do
> > > > >            cat /proc/$PID/status
> > > > > 	usleep 100000
> > > > 
> > > > Hm but this limits the reading to 10 per second? If we want to simulate an
> > > > adversary process, it should be without the sleeps I think?
> > > 
> > > OK. I drop the usleep, and I still can not see obvious impact.
> > > 
> > > w/o patch:
> > > stress-ng: info:  [6848]          4,399,219,085,152 CPU Cycles
> > > 67.327 B/sec
> > > stress-ng: info:  [6848]          1,616,524,844,832 Instructions
> > > 24.740 B/sec (0.367 instr. per cycle)
> > > stress-ng: info:  [6848]                 39,529,792 Page Faults Total
> > > 0.605 M/sec
> > > stress-ng: info:  [6848]                 39,529,792 Page Faults Minor
> > > 0.605 M/sec
> > > 
> > > w/patch:
> > > stress-ng: info:  [2485]          4,462,440,381,856 CPU Cycles
> > > 68.382 B/sec
> > > stress-ng: info:  [2485]          1,615,101,503,296 Instructions
> > > 24.750 B/sec (0.362 instr. per cycle)
> > > stress-ng: info:  [2485]                 39,439,232 Page Faults Total
> > > 0.604 M/sec
> > > stress-ng: info:  [2485]                 39,439,232 Page Faults Minor
> > > 0.604 M/sec
> > 
> > Is the above with 32 non-sleeping parallel reader scripts?
> 
> Yes.

Thanks, this seems much more representative. Please update the changelog
with this. With that feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-05  6:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-24  1:59 [PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
2025-05-30  3:53 ` Andrew Morton
2025-05-30 13:39   ` Michal Hocko
2025-05-30 23:00     ` Andrew Morton
2025-06-03  8:08     ` Baolin Wang
2025-06-03  8:15       ` Michal Hocko
2025-06-03  8:32         ` Baolin Wang
2025-06-03 10:28           ` Michal Hocko
2025-06-03 14:22             ` Baolin Wang
2025-06-03 14:48               ` Michal Hocko
2025-06-03 17:29                 ` Shakeel Butt
2025-06-04 12:46                   ` Baolin Wang
2025-06-04 13:46                     ` Vlastimil Babka
2025-06-04 14:16                       ` Baolin Wang
2025-06-04 14:27                         ` Vlastimil Babka
2025-06-04 16:54                         ` Shakeel Butt
2025-06-05  0:48                           ` Baolin Wang
2025-06-05  6:32                             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).