* [PATCH] mm: fix the inaccurate memory statistics issue for users
@ 2025-05-24 1:59 Baolin Wang
2025-05-30 3:53 ` Andrew Morton
0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-05-24 1:59 UTC (permalink / raw)
To: akpm, david, shakeel.butt
Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
donettom, aboorvad, sj, baolin.wang, linux-mm, linux-fsdevel,
linux-kernel
On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
The main reason is that the batch size of the percpu counter is quite large
on these machines, caching a significant percpu value, since converting mm's
rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
stats into percpu_counter"). Intuitively, the batch number should be optimized,
but on some paths, performance may take precedence over statistical accuracy.
Therefore, introducing a new interface to add the percpu statistical count
and display it to users, which can remove the confusion. In addition, this
change is not expected to be on a performance-critical path, so the modification
should be acceptable.
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Tested-by Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from RFC:
- Collect reviewed and tested tags. Thanks.
- Add Fixes tag.
---
fs/proc/task_mmu.c | 14 +++++++-------
include/linux/mm.h | 5 +++++
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f629e6526935 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
unsigned long text, lib, swap, anon, file, shmem;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
- anon = get_mm_counter(mm, MM_ANONPAGES);
- file = get_mm_counter(mm, MM_FILEPAGES);
- shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+ anon = get_mm_counter_sum(mm, MM_ANONPAGES);
+ file = get_mm_counter_sum(mm, MM_FILEPAGES);
+ shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
text = min(text, mm->exec_vm << PAGE_SHIFT);
lib = (mm->exec_vm << PAGE_SHIFT) - text;
- swap = get_mm_counter(mm, MM_SWAPENTS);
+ swap = get_mm_counter_sum(mm, MM_SWAPENTS);
SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
- *shared = get_mm_counter(mm, MM_FILEPAGES) +
- get_mm_counter(mm, MM_SHMEMPAGES);
+ *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
+ get_mm_counter_sum(mm, MM_SHMEMPAGES);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->data_vm + mm->stack_vm;
- *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+ *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
return mm->total_vm;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 185424858f23..15ec5cfe9515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
return percpu_counter_read_positive(&mm->rss_stat[member]);
}
+static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
+{
+ return percpu_counter_sum_positive(&mm->rss_stat[member]);
+}
+
void mm_trace_rss_stat(struct mm_struct *mm, int member);
static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-05-24 1:59 [PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
@ 2025-05-30 3:53 ` Andrew Morton
2025-05-30 13:39 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2025-05-30 3:53 UTC (permalink / raw)
To: Baolin Wang
Cc: david, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, donettom, aboorvad, sj, linux-mm, linux-fsdevel,
linux-kernel
On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> On some large machines with a high number of CPUs running a 64K pagesize
> kernel, we found that the 'RES' field is always 0 displayed by the top
> command for some processes, which will cause a lot of confusion for users.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Three years ago.
> Tested-by Donet Tom <donettom@linux.ibm.com>
> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: SeongJae Park <sj@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Thanks, I added cc:stable to this.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-05-30 3:53 ` Andrew Morton
@ 2025-05-30 13:39 ` Michal Hocko
2025-05-30 23:00 ` Andrew Morton
2025-06-03 8:08 ` Baolin Wang
0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2025-05-30 13:39 UTC (permalink / raw)
To: Andrew Morton
Cc: Baolin Wang, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On Thu 29-05-25 20:53:13, Andrew Morton wrote:
> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>
> > On some large machines with a high number of CPUs running a 64K pagesize
> > kernel, we found that the 'RES' field is always 0 displayed by the top
> > command for some processes, which will cause a lot of confusion for users.
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> > 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
> >
> > The main reason is that the batch size of the percpu counter is quite large
> > on these machines, caching a significant percpu value, since converting mm's
> > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > stats into percpu_counter"). Intuitively, the batch number should be optimized,
> > but on some paths, performance may take precedence over statistical accuracy.
> > Therefore, introducing a new interface to add the percpu statistical count
> > and display it to users, which can remove the confusion. In addition, this
> > change is not expected to be on a performance-critical path, so the modification
> > should be acceptable.
> >
> > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>
> Three years ago.
>
> > Tested-by Donet Tom <donettom@linux.ibm.com>
> > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Acked-by: SeongJae Park <sj@kernel.org>
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> Thanks, I added cc:stable to this.
I have only noticed this new posting now. I do not think this is a
stable material. I am also not convinced that the impact of the pcp lock
exposure to the userspace has been properly analyzed and documented in
the changelog. I am not nacking the patch (yet) but I would like to see
a serious analyses that this has been properly thought through.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-05-30 13:39 ` Michal Hocko
@ 2025-05-30 23:00 ` Andrew Morton
2025-06-03 8:08 ` Baolin Wang
1 sibling, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2025-05-30 23:00 UTC (permalink / raw)
To: Michal Hocko
Cc: Baolin Wang, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On Fri, 30 May 2025 15:39:36 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> >
> > Three years ago.
> >
> > > Tested-by Donet Tom <donettom@linux.ibm.com>
> > > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Acked-by: SeongJae Park <sj@kernel.org>
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > Thanks, I added cc:stable to this.
>
> I have only noticed this new posting now. I do not think this is a
> stable material. I am also not convinced that the impact of the pcp lock
> exposure to the userspace has been properly analyzed and documented in
> the changelog. I am not nacking the patch (yet) but I would like to see
> a serious analyses that this has been properly thought through.
Thanks. I'll move this into the mm-new branch while we work through
these things.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-05-30 13:39 ` Michal Hocko
2025-05-30 23:00 ` Andrew Morton
@ 2025-06-03 8:08 ` Baolin Wang
2025-06-03 8:15 ` Michal Hocko
1 sibling, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-03 8:08 UTC (permalink / raw)
To: Michal Hocko, Andrew Morton
Cc: david, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, donettom, aboorvad, sj, linux-mm, linux-fsdevel,
linux-kernel
On 2025/5/30 21:39, Michal Hocko wrote:
> On Thu 29-05-25 20:53:13, Andrew Morton wrote:
>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>
>>> On some large machines with a high number of CPUs running a 64K pagesize
>>> kernel, we found that the 'RES' field is always 0 displayed by the top
>>> command for some processes, which will cause a lot of confusion for users.
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
>>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>>>
>>> The main reason is that the batch size of the percpu counter is quite large
>>> on these machines, caching a significant percpu value, since converting mm's
>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>>> but on some paths, performance may take precedence over statistical accuracy.
>>> Therefore, introducing a new interface to add the percpu statistical count
>>> and display it to users, which can remove the confusion. In addition, this
>>> change is not expected to be on a performance-critical path, so the modification
>>> should be acceptable.
>>>
>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>>
>> Three years ago.
>>
>>> Tested-by Donet Tom <donettom@linux.ibm.com>
>>> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> Acked-by: SeongJae Park <sj@kernel.org>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>
>> Thanks, I added cc:stable to this.
>
> I have only noticed this new posting now. I do not think this is a
> stable material. I am also not convinced that the impact of the pcp lock
> exposure to the userspace has been properly analyzed and documented in
> the changelog. I am not nacking the patch (yet) but I would like to see
> a serious analyses that this has been properly thought through.
Good point. I did a quick measurement on my 32 cores Arm machine. I ran
two workloads, one is the 'top' command: top -d 1 (updating every
second). Another workload is kernel building (time make -j32).
From the following data, I did not see any significant impact of the
patch changes on the execution of the kernel building workload.
w/o patch:
real 4m33.887s
user 118m24.153s
sys 9m51.402s
w/ patch:
real 4m34.495s
user 118m21.739s
sys 9m39.232s
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 8:08 ` Baolin Wang
@ 2025-06-03 8:15 ` Michal Hocko
2025-06-03 8:32 ` Baolin Wang
0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-06-03 8:15 UTC (permalink / raw)
To: Baolin Wang
Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On Tue 03-06-25 16:08:21, Baolin Wang wrote:
>
>
> On 2025/5/30 21:39, Michal Hocko wrote:
> > On Thu 29-05-25 20:53:13, Andrew Morton wrote:
> > > On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > >
> > > > On some large machines with a high number of CPUs running a 64K pagesize
> > > > kernel, we found that the 'RES' field is always 0 displayed by the top
> > > > command for some processes, which will cause a lot of confusion for users.
> > > >
> > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > > 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> > > > 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
> > > >
> > > > The main reason is that the batch size of the percpu counter is quite large
> > > > on these machines, caching a significant percpu value, since converting mm's
> > > > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > > > stats into percpu_counter"). Intuitively, the batch number should be optimized,
> > > > but on some paths, performance may take precedence over statistical accuracy.
> > > > Therefore, introducing a new interface to add the percpu statistical count
> > > > and display it to users, which can remove the confusion. In addition, this
> > > > change is not expected to be on a performance-critical path, so the modification
> > > > should be acceptable.
> > > >
> > > > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> > >
> > > Three years ago.
> > >
> > > > Tested-by Donet Tom <donettom@linux.ibm.com>
> > > > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > Acked-by: SeongJae Park <sj@kernel.org>
> > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > >
> > > Thanks, I added cc:stable to this.
> >
> > I have only noticed this new posting now. I do not think this is a
> > stable material. I am also not convinced that the impact of the pcp lock
> > exposure to the userspace has been properly analyzed and documented in
> > the changelog. I am not nacking the patch (yet) but I would like to see
> > a serious analyses that this has been properly thought through.
>
> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
> workloads, one is the 'top' command: top -d 1 (updating every second).
> Another workload is kernel building (time make -j32).
>
> From the following data, I did not see any significant impact of the patch
> changes on the execution of the kernel building workload.
I do not think this is really representative of an adverse workload. I
believe you need to have a look which potentially sensitive kernel code
paths run with the lock held how would a busy loop over affected proc
files influence those in the worst case. Maybe there are none of such
kernel code paths to really worry about. This should be a part of the
changelog though.
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 8:15 ` Michal Hocko
@ 2025-06-03 8:32 ` Baolin Wang
2025-06-03 10:28 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-03 8:32 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On 2025/6/3 16:15, Michal Hocko wrote:
> On Tue 03-06-25 16:08:21, Baolin Wang wrote:
>>
>>
>> On 2025/5/30 21:39, Michal Hocko wrote:
>>> On Thu 29-05-25 20:53:13, Andrew Morton wrote:
>>>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>> On some large machines with a high number of CPUs running a 64K pagesize
>>>>> kernel, we found that the 'RES' field is always 0 displayed by the top
>>>>> command for some processes, which will cause a lot of confusion for users.
>>>>>
>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
>>>>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>>>>>
>>>>> The main reason is that the batch size of the percpu counter is quite large
>>>>> on these machines, caching a significant percpu value, since converting mm's
>>>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>>>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>>>>> but on some paths, performance may take precedence over statistical accuracy.
>>>>> Therefore, introducing a new interface to add the percpu statistical count
>>>>> and display it to users, which can remove the confusion. In addition, this
>>>>> change is not expected to be on a performance-critical path, so the modification
>>>>> should be acceptable.
>>>>>
>>>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>>>>
>>>> Three years ago.
>>>>
>>>>> Tested-by Donet Tom <donettom@linux.ibm.com>
>>>>> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>> Acked-by: SeongJae Park <sj@kernel.org>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>
>>>> Thanks, I added cc:stable to this.
>>>
>>> I have only noticed this new posting now. I do not think this is a
>>> stable material. I am also not convinced that the impact of the pcp lock
>>> exposure to the userspace has been properly analyzed and documented in
>>> the changelog. I am not nacking the patch (yet) but I would like to see
>>> a serious analyses that this has been properly thought through.
>>
>> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
>> workloads, one is the 'top' command: top -d 1 (updating every second).
>> Another workload is kernel building (time make -j32).
>>
>> From the following data, I did not see any significant impact of the patch
>> changes on the execution of the kernel building workload.
>
> I do not think this is really representative of an adverse workload. I
> believe you need to have a look which potentially sensitive kernel code
> paths run with the lock held how would a busy loop over affected proc
> files influence those in the worst case. Maybe there are none of such
> kernel code paths to really worry about. This should be a part of the
> changelog though.
IMO, kernel code paths usually have batch caching to avoid lock
contention, so I think the impact on kernel code paths is not that
obvious. Therefore, I also think it's hard to find an adverse workload.
How about adding the following comments in the commit log?
"
I did a quick measurement on my 32 cores Arm machine. I ran two
workloads, one is the 'top' command: top -d 1 (updating every second).
Another workload is kernel building (time make -j32).
From the following data, I did not see any significant impact of the
patch changes on the execution of the kernel building workload. In
addition, kernel code paths usually have batch caching to avoid pcp lock
contention, so I think the impact on kernel code paths is not that
obvious even the pcp lock is exposed to the userspace.
w/o patch:
real 4m33.887s
user 118m24.153s
sys 9m51.402s
w/ patch:
real 4m34.495s
user 118m21.739s
sys 9m39.232s
"
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 8:32 ` Baolin Wang
@ 2025-06-03 10:28 ` Michal Hocko
2025-06-03 14:22 ` Baolin Wang
0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-06-03 10:28 UTC (permalink / raw)
To: Baolin Wang
Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On Tue 03-06-25 16:32:35, Baolin Wang wrote:
>
>
> On 2025/6/3 16:15, Michal Hocko wrote:
> > On Tue 03-06-25 16:08:21, Baolin Wang wrote:
> > >
> > >
> > > On 2025/5/30 21:39, Michal Hocko wrote:
> > > > On Thu 29-05-25 20:53:13, Andrew Morton wrote:
> > > > > On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > > > >
> > > > > > On some large machines with a high number of CPUs running a 64K pagesize
> > > > > > kernel, we found that the 'RES' field is always 0 displayed by the top
> > > > > > command for some processes, which will cause a lot of confusion for users.
> > > > > >
> > > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > > > > 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> > > > > > 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
> > > > > >
> > > > > > The main reason is that the batch size of the percpu counter is quite large
> > > > > > on these machines, caching a significant percpu value, since converting mm's
> > > > > > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > > > > > stats into percpu_counter"). Intuitively, the batch number should be optimized,
> > > > > > but on some paths, performance may take precedence over statistical accuracy.
> > > > > > Therefore, introducing a new interface to add the percpu statistical count
> > > > > > and display it to users, which can remove the confusion. In addition, this
> > > > > > change is not expected to be on a performance-critical path, so the modification
> > > > > > should be acceptable.
> > > > > >
> > > > > > Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
> > > > >
> > > > > Three years ago.
> > > > >
> > > > > > Tested-by Donet Tom <donettom@linux.ibm.com>
> > > > > > Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > > > Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> > > > > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > > > Acked-by: SeongJae Park <sj@kernel.org>
> > > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > >
> > > > > Thanks, I added cc:stable to this.
> > > >
> > > > I have only noticed this new posting now. I do not think this is a
> > > > stable material. I am also not convinced that the impact of the pcp lock
> > > > exposure to the userspace has been properly analyzed and documented in
> > > > the changelog. I am not nacking the patch (yet) but I would like to see
> > > > a serious analyses that this has been properly thought through.
> > >
> > > Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
> > > workloads, one is the 'top' command: top -d 1 (updating every second).
> > > Another workload is kernel building (time make -j32).
> > >
> > > From the following data, I did not see any significant impact of the patch
> > > changes on the execution of the kernel building workload.
> >
> > I do not think this is really representative of an adverse workload. I
> > believe you need to have a look which potentially sensitive kernel code
> > paths run with the lock held how would a busy loop over affected proc
> > files influence those in the worst case. Maybe there are none of such
> > kernel code paths to really worry about. This should be a part of the
> > changelog though.
>
> IMO, kernel code paths usually have batch caching to avoid lock contention,
> so I think the impact on kernel code paths is not that obvious.
This is a very generic statement. Does this refer to the existing pcp
locking usage in the kernel? Have you evaluated existing users?
> Therefore, I
> also think it's hard to find an adverse workload.
>
> How about adding the following comments in the commit log?
> "
> I did a quick measurement on my 32 cores Arm machine. I ran two workloads,
> one is the 'top' command: top -d 1 (updating every second). Another workload
> is kernel building (time make -j32).
This test doesn't really do much to trigger an actual lock contention as
already mentioned.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 10:28 ` Michal Hocko
@ 2025-06-03 14:22 ` Baolin Wang
2025-06-03 14:48 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-03 14:22 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On 2025/6/3 18:28, Michal Hocko wrote:
> On Tue 03-06-25 16:32:35, Baolin Wang wrote:
>>
>>
>> On 2025/6/3 16:15, Michal Hocko wrote:
>>> On Tue 03-06-25 16:08:21, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/5/30 21:39, Michal Hocko wrote:
>>>>> On Thu 29-05-25 20:53:13, Andrew Morton wrote:
>>>>>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>> On some large machines with a high number of CPUs running a 64K pagesize
>>>>>>> kernel, we found that the 'RES' field is always 0 displayed by the top
>>>>>>> command for some processes, which will cause a lot of confusion for users.
>>>>>>>
>>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>>>>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
>>>>>>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>>>>>>>
>>>>>>> The main reason is that the batch size of the percpu counter is quite large
>>>>>>> on these machines, caching a significant percpu value, since converting mm's
>>>>>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>>>>>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>>>>>>> but on some paths, performance may take precedence over statistical accuracy.
>>>>>>> Therefore, introducing a new interface to add the percpu statistical count
>>>>>>> and display it to users, which can remove the confusion. In addition, this
>>>>>>> change is not expected to be on a performance-critical path, so the modification
>>>>>>> should be acceptable.
>>>>>>>
>>>>>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
>>>>>>
>>>>>> Three years ago.
>>>>>>
>>>>>>> Tested-by Donet Tom <donettom@linux.ibm.com>
>>>>>>> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>>>> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>>>>>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>>>>>> Acked-by: SeongJae Park <sj@kernel.org>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>
>>>>>> Thanks, I added cc:stable to this.
>>>>>
>>>>> I have only noticed this new posting now. I do not think this is a
>>>>> stable material. I am also not convinced that the impact of the pcp lock
>>>>> exposure to the userspace has been properly analyzed and documented in
>>>>> the changelog. I am not nacking the patch (yet) but I would like to see
>>>>> a serious analyses that this has been properly thought through.
>>>>
>>>> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two
>>>> workloads, one is the 'top' command: top -d 1 (updating every second).
>>>> Another workload is kernel building (time make -j32).
>>>>
>>>> From the following data, I did not see any significant impact of the patch
>>>> changes on the execution of the kernel building workload.
>>>
>>> I do not think this is really representative of an adverse workload. I
>>> believe you need to have a look which potentially sensitive kernel code
>>> paths run with the lock held how would a busy loop over affected proc
>>> files influence those in the worst case. Maybe there are none of such
>>> kernel code paths to really worry about. This should be a part of the
>>> changelog though.
>>
>> IMO, kernel code paths usually have batch caching to avoid lock contention,
>> so I think the impact on kernel code paths is not that obvious.
>
> This is a very generic statement. Does this refer to the existing pcp
> locking usage in the kernel? Have you evaluated existing users?
Let me try to clarify further.
The 'mm->rss_stat' is updated by using add_mm_counter(),
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention. This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this will not exacerbate kernel 'mm->rss_stat' lock
contention due to the the percpu batch caching of the mm counters.
You might argue that my test cases cannot demonstrate an actual lock
contention, but they have already shown that there is no significant
'fbc->lock' contention when the kernel updates 'mm->rss_stat'.
>> Therefore, I
>> also think it's hard to find an adverse workload.
>>
>> How about adding the following comments in the commit log?
>> "
>> I did a quick measurement on my 32 cores Arm machine. I ran two workloads,
>> one is the 'top' command: top -d 1 (updating every second). Another workload
>> is kernel building (time make -j32).
>
> This test doesn't really do much to trigger an actual lock contention as
> already mentioned.
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 14:22 ` Baolin Wang
@ 2025-06-03 14:48 ` Michal Hocko
2025-06-03 17:29 ` Shakeel Butt
0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2025-06-03 14:48 UTC (permalink / raw)
To: Baolin Wang
Cc: Andrew Morton, david, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On Tue 03-06-25 22:22:46, Baolin Wang wrote:
> Let me try to clarify further.
>
> The 'mm->rss_stat' is updated by using add_mm_counter(),
> dec/inc_mm_counter(), which are all wrappers around
> percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
> batch caching to avoid 'fbc->lock' contention.
OK, this is exactly the line of argument I was looking for. If _all_
updates done in the kernel are using batching and therefore the lock is
only held every N (percpu_counter_batch) updates then a risk of locking
contention would be decreased. This is worth having a note in the
changelog.
> This patch changes task_mem()
> and task_statm() to get the accurate mm counters under the 'fbc->lock', but
> this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
> the percpu batch caching of the mm counters.
>
> You might argue that my test cases cannot demonstrate an actual lock
> contention, but they have already shown that there is no significant
> 'fbc->lock' contention when the kernel updates 'mm->rss_stat'.
I was arguing that `top -d 1' doesn't really represent a potential
adverse usage. These proc files are generally readable so I would be
expecting something like busy loop read while process tries to update
counters to see the worst case scenario. If that is barely visible then
we can conclude a normal use wouldn't even notice.
See my point?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 14:48 ` Michal Hocko
@ 2025-06-03 17:29 ` Shakeel Butt
2025-06-04 12:46 ` Baolin Wang
0 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-06-03 17:29 UTC (permalink / raw)
To: Michal Hocko
Cc: Baolin Wang, Andrew Morton, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, donettom, aboorvad, sj, linux-mm,
linux-fsdevel, linux-kernel
On Tue, Jun 03, 2025 at 04:48:08PM +0200, Michal Hocko wrote:
> On Tue 03-06-25 22:22:46, Baolin Wang wrote:
> > Let me try to clarify further.
> >
> > The 'mm->rss_stat' is updated by using add_mm_counter(),
> > dec/inc_mm_counter(), which are all wrappers around
> > percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
> > batch caching to avoid 'fbc->lock' contention.
>
> OK, this is exactly the line of argument I was looking for. If _all_
> updates done in the kernel are using batching and therefore the lock is
> only held every N (percpu_counter_batch) updates then a risk of locking
> contention would be decreased. This is worth having a note in the
> changelog.
>
> > This patch changes task_mem()
> > and task_statm() to get the accurate mm counters under the 'fbc->lock', but
> > this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
> > the percpu batch caching of the mm counters.
> >
> > You might argue that my test cases cannot demonstrate an actual lock
> > contention, but they have already shown that there is no significant
> > 'fbc->lock' contention when the kernel updates 'mm->rss_stat'.
>
> I was arguing that `top -d 1' doesn't really represent a potential
> adverse usage. These proc files are generally readable so I would be
> expecting something like busy loop read while process tries to update
> counters to see the worst case scenario. If that is barely visible then
> we can conclude a normal use wouldn't even notice.
>
Baolin, please run stress-ng command that stresses minor anon page
faults in multiple threads and then run multiple bash scripts which cat
/proc/pidof(stress-ng)/status. That should be how much the stress-ng
process is impacted by the parallel status readers versus without them.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-03 17:29 ` Shakeel Butt
@ 2025-06-04 12:46 ` Baolin Wang
2025-06-04 13:46 ` Vlastimil Babka
0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-04 12:46 UTC (permalink / raw)
To: Shakeel Butt, Michal Hocko
Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, donettom, aboorvad, sj, linux-mm, linux-fsdevel,
linux-kernel
On 2025/6/4 01:29, Shakeel Butt wrote:
> On Tue, Jun 03, 2025 at 04:48:08PM +0200, Michal Hocko wrote:
>> On Tue 03-06-25 22:22:46, Baolin Wang wrote:
>>> Let me try to clarify further.
>>>
>>> The 'mm->rss_stat' is updated by using add_mm_counter(),
>>> dec/inc_mm_counter(), which are all wrappers around
>>> percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu
>>> batch caching to avoid 'fbc->lock' contention.
>>
>> OK, this is exactly the line of argument I was looking for. If _all_
>> updates done in the kernel are using batching and therefore the lock is
>> only held every N (percpu_counter_batch) updates then a risk of locking
>> contention would be decreased. This is worth having a note in the
>> changelog.
OK.
>>> This patch changes task_mem()
>>> and task_statm() to get the accurate mm counters under the 'fbc->lock', but
>>> this will not exacerbate kernel 'mm->rss_stat' lock contention due to the
>>> the percpu batch caching of the mm counters.
>>>
>>> You might argue that my test cases cannot demonstrate an actual lock
>>> contention, but they have already shown that there is no significant
>>> 'fbc->lock' contention when the kernel updates 'mm->rss_stat'.
>>
>> I was arguing that `top -d 1' doesn't really represent a potential
>> adverse usage. These proc files are generally readable so I would be
>> expecting something like busy loop read while process tries to update
>> counters to see the worst case scenario. If that is barely visible then
>> we can conclude a normal use wouldn't even notice.
OK.
> Baolin, please run stress-ng command that stresses minor anon page
> faults in multiple threads and then run multiple bash scripts which cat
> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
> process is impacted by the parallel status readers versus without them.
Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
--perf -t 1m' command, while simultaneously running the following
scripts to read the /proc/pidof(stress-ng)/status for each thread.
From the following data, I did not observe any obvious impact of this
patch on the stress-ng tests when repeatedly reading the
/proc/pidof(stress-ng)/status.
w/o patch
stress-ng: info: [6891] 3,993,235,331,584 CPU Cycles
59.767 B/sec
stress-ng: info: [6891] 1,472,101,565,760 Instructions
22.033 B/sec (0.369 instr. per cycle)
stress-ng: info: [6891] 36,287,456 Page Faults Total
0.543 M/sec
stress-ng: info: [6891] 36,287,456 Page Faults Minor
0.543 M/sec
w/ patch
stress-ng: info: [6872] 4,018,592,975,968 CPU Cycles
60.177 B/sec
stress-ng: info: [6872] 1,484,856,150,976 Instructions
22.235 B/sec (0.369 instr. per cycle)
stress-ng: info: [6872] 36,547,456 Page Faults Total
0.547 M/sec
stress-ng: info: [6872] 36,547,456 Page Faults Minor
0.547 M/sec
=========================
#!/bin/bash
# Get the PIDs of stress-ng processes
PIDS=$(pgrep stress-ng)
# Loop through each PID and monitor /proc/[pid]/status
for PID in $PIDS; do
while true; do
cat /proc/$PID/status
usleep 100000
done &
done
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-04 12:46 ` Baolin Wang
@ 2025-06-04 13:46 ` Vlastimil Babka
2025-06-04 14:16 ` Baolin Wang
0 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-06-04 13:46 UTC (permalink / raw)
To: Baolin Wang, Shakeel Butt, Michal Hocko
Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb,
donettom, aboorvad, sj, linux-mm, linux-fsdevel, linux-kernel
On 6/4/25 14:46, Baolin Wang wrote:
>> Baolin, please run stress-ng command that stresses minor anon page
>> faults in multiple threads and then run multiple bash scripts which cat
>> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
>> process is impacted by the parallel status readers versus without them.
>
> Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
> --perf -t 1m' command, while simultaneously running the following
> scripts to read the /proc/pidof(stress-ng)/status for each thread.
How many of those scripts?
> From the following data, I did not observe any obvious impact of this
> patch on the stress-ng tests when repeatedly reading the
> /proc/pidof(stress-ng)/status.
>
> w/o patch
> stress-ng: info: [6891] 3,993,235,331,584 CPU Cycles
> 59.767 B/sec
> stress-ng: info: [6891] 1,472,101,565,760 Instructions
> 22.033 B/sec (0.369 instr. per cycle)
> stress-ng: info: [6891] 36,287,456 Page Faults Total
> 0.543 M/sec
> stress-ng: info: [6891] 36,287,456 Page Faults Minor
> 0.543 M/sec
>
> w/ patch
> stress-ng: info: [6872] 4,018,592,975,968 CPU Cycles
> 60.177 B/sec
> stress-ng: info: [6872] 1,484,856,150,976 Instructions
> 22.235 B/sec (0.369 instr. per cycle)
> stress-ng: info: [6872] 36,547,456 Page Faults Total
> 0.547 M/sec
> stress-ng: info: [6872] 36,547,456 Page Faults Minor
> 0.547 M/sec
>
> =========================
> #!/bin/bash
>
> # Get the PIDs of stress-ng processes
> PIDS=$(pgrep stress-ng)
>
> # Loop through each PID and monitor /proc/[pid]/status
> for PID in $PIDS; do
> while true; do
> cat /proc/$PID/status
> usleep 100000
Hm but this limits the reading to 10 per second? If we want to simulate an
adversary process, it should be without the sleeps I think?
> done &
> done
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-04 13:46 ` Vlastimil Babka
@ 2025-06-04 14:16 ` Baolin Wang
2025-06-04 14:27 ` Vlastimil Babka
2025-06-04 16:54 ` Shakeel Butt
0 siblings, 2 replies; 18+ messages in thread
From: Baolin Wang @ 2025-06-04 14:16 UTC (permalink / raw)
To: Vlastimil Babka, Shakeel Butt, Michal Hocko
Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb,
donettom, aboorvad, sj, linux-mm, linux-fsdevel, linux-kernel
On 2025/6/4 21:46, Vlastimil Babka wrote:
> On 6/4/25 14:46, Baolin Wang wrote:
>>> Baolin, please run stress-ng command that stresses minor anon page
>>> faults in multiple threads and then run multiple bash scripts which cat
>>> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
>>> process is impacted by the parallel status readers versus without them.
>>
>> Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
>> --perf -t 1m' command, while simultaneously running the following
>> scripts to read the /proc/pidof(stress-ng)/status for each thread.
>
> How many of those scripts?
1 script, but will start 32 threads to read each stress-ng thread's
status interface.
>> From the following data, I did not observe any obvious impact of this
>> patch on the stress-ng tests when repeatedly reading the
>> /proc/pidof(stress-ng)/status.
>>
>> w/o patch
>> stress-ng: info: [6891] 3,993,235,331,584 CPU Cycles
>> 59.767 B/sec
>> stress-ng: info: [6891] 1,472,101,565,760 Instructions
>> 22.033 B/sec (0.369 instr. per cycle)
>> stress-ng: info: [6891] 36,287,456 Page Faults Total
>> 0.543 M/sec
>> stress-ng: info: [6891] 36,287,456 Page Faults Minor
>> 0.543 M/sec
>>
>> w/ patch
>> stress-ng: info: [6872] 4,018,592,975,968 CPU Cycles
>> 60.177 B/sec
>> stress-ng: info: [6872] 1,484,856,150,976 Instructions
>> 22.235 B/sec (0.369 instr. per cycle)
>> stress-ng: info: [6872] 36,547,456 Page Faults Total
>> 0.547 M/sec
>> stress-ng: info: [6872] 36,547,456 Page Faults Minor
>> 0.547 M/sec
>>
>> =========================
>> #!/bin/bash
>>
>> # Get the PIDs of stress-ng processes
>> PIDS=$(pgrep stress-ng)
>>
>> # Loop through each PID and monitor /proc/[pid]/status
>> for PID in $PIDS; do
>> while true; do
>> cat /proc/$PID/status
>> usleep 100000
>
> Hm but this limits the reading to 10 per second? If we want to simulate an
> adversary process, it should be without the sleeps I think?
OK. I drop the usleep, and I still can not see obvious impact.
w/o patch:
stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles
67.327 B/sec
stress-ng: info: [6848] 1,616,524,844,832 Instructions
24.740 B/sec (0.367 instr. per cycle)
stress-ng: info: [6848] 39,529,792 Page Faults Total
0.605 M/sec
stress-ng: info: [6848] 39,529,792 Page Faults Minor
0.605 M/sec
w/patch:
stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles
68.382 B/sec
stress-ng: info: [2485] 1,615,101,503,296 Instructions
24.750 B/sec (0.362 instr. per cycle)
stress-ng: info: [2485] 39,439,232 Page Faults Total
0.604 M/sec
stress-ng: info: [2485] 39,439,232 Page Faults Minor
0.604 M/sec
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-04 14:16 ` Baolin Wang
@ 2025-06-04 14:27 ` Vlastimil Babka
2025-06-04 16:54 ` Shakeel Butt
1 sibling, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2025-06-04 14:27 UTC (permalink / raw)
To: Baolin Wang, Shakeel Butt, Michal Hocko
Cc: Andrew Morton, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb,
donettom, aboorvad, sj, linux-mm, linux-fsdevel, linux-kernel
On 6/4/25 16:16, Baolin Wang wrote:
>>>
>>> # Get the PIDs of stress-ng processes
>>> PIDS=$(pgrep stress-ng)
>>>
>>> # Loop through each PID and monitor /proc/[pid]/status
>>> for PID in $PIDS; do
>>> while true; do
>>> cat /proc/$PID/status
>>> usleep 100000
>>
>> Hm but this limits the reading to 10 per second? If we want to simulate an
>> adversary process, it should be without the sleeps I think?
>
> OK. I drop the usleep, and I still can not see obvious impact.
Thanks, that's reassuring.
> w/o patch:
> stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles
> 67.327 B/sec
> stress-ng: info: [6848] 1,616,524,844,832 Instructions
> 24.740 B/sec (0.367 instr. per cycle)
> stress-ng: info: [6848] 39,529,792 Page Faults Total
> 0.605 M/sec
> stress-ng: info: [6848] 39,529,792 Page Faults Minor
> 0.605 M/sec
>
> w/patch:
> stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles
> 68.382 B/sec
> stress-ng: info: [2485] 1,615,101,503,296 Instructions
> 24.750 B/sec (0.362 instr. per cycle)
> stress-ng: info: [2485] 39,439,232 Page Faults Total
> 0.604 M/sec
> stress-ng: info: [2485] 39,439,232 Page Faults Minor
> 0.604 M/sec
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-04 14:16 ` Baolin Wang
2025-06-04 14:27 ` Vlastimil Babka
@ 2025-06-04 16:54 ` Shakeel Butt
2025-06-05 0:48 ` Baolin Wang
1 sibling, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-06-04 16:54 UTC (permalink / raw)
To: Baolin Wang
Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, david,
lorenzo.stoakes, Liam.Howlett, rppt, surenb, donettom, aboorvad,
sj, linux-mm, linux-fsdevel, linux-kernel
On Wed, Jun 04, 2025 at 10:16:18PM +0800, Baolin Wang wrote:
>
>
> On 2025/6/4 21:46, Vlastimil Babka wrote:
> > On 6/4/25 14:46, Baolin Wang wrote:
> > > > Baolin, please run stress-ng command that stresses minor anon page
> > > > faults in multiple threads and then run multiple bash scripts which cat
> > > > /proc/pidof(stress-ng)/status. That should be how much the stress-ng
> > > > process is impacted by the parallel status readers versus without them.
> > >
> > > Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
> > > --perf -t 1m' command, while simultaneously running the following
> > > scripts to read the /proc/pidof(stress-ng)/status for each thread.
> >
> > How many of those scripts?
>
> 1 script, but will start 32 threads to read each stress-ng thread's status
> interface.
>
> > > From the following data, I did not observe any obvious impact of this
> > > patch on the stress-ng tests when repeatedly reading the
> > > /proc/pidof(stress-ng)/status.
> > >
> > > w/o patch
> > > stress-ng: info: [6891] 3,993,235,331,584 CPU Cycles
> > > 59.767 B/sec
> > > stress-ng: info: [6891] 1,472,101,565,760 Instructions
> > > 22.033 B/sec (0.369 instr. per cycle)
> > > stress-ng: info: [6891] 36,287,456 Page Faults Total
> > > 0.543 M/sec
> > > stress-ng: info: [6891] 36,287,456 Page Faults Minor
> > > 0.543 M/sec
> > >
> > > w/ patch
> > > stress-ng: info: [6872] 4,018,592,975,968 CPU Cycles
> > > 60.177 B/sec
> > > stress-ng: info: [6872] 1,484,856,150,976 Instructions
> > > 22.235 B/sec (0.369 instr. per cycle)
> > > stress-ng: info: [6872] 36,547,456 Page Faults Total
> > > 0.547 M/sec
> > > stress-ng: info: [6872] 36,547,456 Page Faults Minor
> > > 0.547 M/sec
> > >
> > > =========================
> > > #!/bin/bash
> > >
> > > # Get the PIDs of stress-ng processes
> > > PIDS=$(pgrep stress-ng)
> > >
> > > # Loop through each PID and monitor /proc/[pid]/status
> > > for PID in $PIDS; do
> > > while true; do
> > > cat /proc/$PID/status
> > > usleep 100000
> >
> > Hm but this limits the reading to 10 per second? If we want to simulate an
> > adversary process, it should be without the sleeps I think?
>
> OK. I drop the usleep, and I still can not see obvious impact.
>
> w/o patch:
> stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles
> 67.327 B/sec
> stress-ng: info: [6848] 1,616,524,844,832 Instructions
> 24.740 B/sec (0.367 instr. per cycle)
> stress-ng: info: [6848] 39,529,792 Page Faults Total
> 0.605 M/sec
> stress-ng: info: [6848] 39,529,792 Page Faults Minor
> 0.605 M/sec
>
> w/patch:
> stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles
> 68.382 B/sec
> stress-ng: info: [2485] 1,615,101,503,296 Instructions
> 24.750 B/sec (0.362 instr. per cycle)
> stress-ng: info: [2485] 39,439,232 Page Faults Total
> 0.604 M/sec
> stress-ng: info: [2485] 39,439,232 Page Faults Minor
> 0.604 M/sec
Is the above with 32 non-sleeping parallel reader scripts?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-04 16:54 ` Shakeel Butt
@ 2025-06-05 0:48 ` Baolin Wang
2025-06-05 6:32 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-05 0:48 UTC (permalink / raw)
To: Shakeel Butt
Cc: Vlastimil Babka, Michal Hocko, Andrew Morton, david,
lorenzo.stoakes, Liam.Howlett, rppt, surenb, donettom, aboorvad,
sj, linux-mm, linux-fsdevel, linux-kernel
On 2025/6/5 00:54, Shakeel Butt wrote:
> On Wed, Jun 04, 2025 at 10:16:18PM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/6/4 21:46, Vlastimil Babka wrote:
>>> On 6/4/25 14:46, Baolin Wang wrote:
>>>>> Baolin, please run stress-ng command that stresses minor anon page
>>>>> faults in multiple threads and then run multiple bash scripts which cat
>>>>> /proc/pidof(stress-ng)/status. That should be how much the stress-ng
>>>>> process is impacted by the parallel status readers versus without them.
>>>>
>>>> Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
>>>> --perf -t 1m' command, while simultaneously running the following
>>>> scripts to read the /proc/pidof(stress-ng)/status for each thread.
>>>
>>> How many of those scripts?
>>
>> 1 script, but will start 32 threads to read each stress-ng thread's status
>> interface.
>>
>>>> From the following data, I did not observe any obvious impact of this
>>>> patch on the stress-ng tests when repeatedly reading the
>>>> /proc/pidof(stress-ng)/status.
>>>>
>>>> w/o patch
>>>> stress-ng: info: [6891] 3,993,235,331,584 CPU Cycles
>>>> 59.767 B/sec
>>>> stress-ng: info: [6891] 1,472,101,565,760 Instructions
>>>> 22.033 B/sec (0.369 instr. per cycle)
>>>> stress-ng: info: [6891] 36,287,456 Page Faults Total
>>>> 0.543 M/sec
>>>> stress-ng: info: [6891] 36,287,456 Page Faults Minor
>>>> 0.543 M/sec
>>>>
>>>> w/ patch
>>>> stress-ng: info: [6872] 4,018,592,975,968 CPU Cycles
>>>> 60.177 B/sec
>>>> stress-ng: info: [6872] 1,484,856,150,976 Instructions
>>>> 22.235 B/sec (0.369 instr. per cycle)
>>>> stress-ng: info: [6872] 36,547,456 Page Faults Total
>>>> 0.547 M/sec
>>>> stress-ng: info: [6872] 36,547,456 Page Faults Minor
>>>> 0.547 M/sec
>>>>
>>>> =========================
>>>> #!/bin/bash
>>>>
>>>> # Get the PIDs of stress-ng processes
>>>> PIDS=$(pgrep stress-ng)
>>>>
>>>> # Loop through each PID and monitor /proc/[pid]/status
>>>> for PID in $PIDS; do
>>>> while true; do
>>>> cat /proc/$PID/status
>>>> usleep 100000
>>>
>>> Hm but this limits the reading to 10 per second? If we want to simulate an
>>> adversary process, it should be without the sleeps I think?
>>
>> OK. I drop the usleep, and I still can not see obvious impact.
>>
>> w/o patch:
>> stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles
>> 67.327 B/sec
>> stress-ng: info: [6848] 1,616,524,844,832 Instructions
>> 24.740 B/sec (0.367 instr. per cycle)
>> stress-ng: info: [6848] 39,529,792 Page Faults Total
>> 0.605 M/sec
>> stress-ng: info: [6848] 39,529,792 Page Faults Minor
>> 0.605 M/sec
>>
>> w/patch:
>> stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles
>> 68.382 B/sec
>> stress-ng: info: [2485] 1,615,101,503,296 Instructions
>> 24.750 B/sec (0.362 instr. per cycle)
>> stress-ng: info: [2485] 39,439,232 Page Faults Total
>> 0.604 M/sec
>> stress-ng: info: [2485] 39,439,232 Page Faults Minor
>> 0.604 M/sec
>
> Is the above with 32 non-sleeping parallel reader scripts?
Yes.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: fix the inaccurate memory statistics issue for users
2025-06-05 0:48 ` Baolin Wang
@ 2025-06-05 6:32 ` Michal Hocko
0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2025-06-05 6:32 UTC (permalink / raw)
To: Baolin Wang
Cc: Shakeel Butt, Vlastimil Babka, Andrew Morton, david,
lorenzo.stoakes, Liam.Howlett, rppt, surenb, donettom, aboorvad,
sj, linux-mm, linux-fsdevel, linux-kernel
On Thu 05-06-25 08:48:07, Baolin Wang wrote:
>
>
> On 2025/6/5 00:54, Shakeel Butt wrote:
> > On Wed, Jun 04, 2025 at 10:16:18PM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 2025/6/4 21:46, Vlastimil Babka wrote:
> > > > On 6/4/25 14:46, Baolin Wang wrote:
> > > > > > Baolin, please run stress-ng command that stresses minor anon page
> > > > > > faults in multiple threads and then run multiple bash scripts which cat
> > > > > > /proc/pidof(stress-ng)/status. That should be how much the stress-ng
> > > > > > process is impacted by the parallel status readers versus without them.
> > > > >
> > > > > Sure. Thanks Shakeel. I run the stress-ng with the 'stress-ng --fault 32
> > > > > --perf -t 1m' command, while simultaneously running the following
> > > > > scripts to read the /proc/pidof(stress-ng)/status for each thread.
> > > >
> > > > How many of those scripts?
> > >
> > > 1 script, but will start 32 threads to read each stress-ng thread's status
> > > interface.
> > >
> > > > > From the following data, I did not observe any obvious impact of this
> > > > > patch on the stress-ng tests when repeatedly reading the
> > > > > /proc/pidof(stress-ng)/status.
> > > > >
> > > > > w/o patch
> > > > > stress-ng: info: [6891] 3,993,235,331,584 CPU Cycles
> > > > > 59.767 B/sec
> > > > > stress-ng: info: [6891] 1,472,101,565,760 Instructions
> > > > > 22.033 B/sec (0.369 instr. per cycle)
> > > > > stress-ng: info: [6891] 36,287,456 Page Faults Total
> > > > > 0.543 M/sec
> > > > > stress-ng: info: [6891] 36,287,456 Page Faults Minor
> > > > > 0.543 M/sec
> > > > >
> > > > > w/ patch
> > > > > stress-ng: info: [6872] 4,018,592,975,968 CPU Cycles
> > > > > 60.177 B/sec
> > > > > stress-ng: info: [6872] 1,484,856,150,976 Instructions
> > > > > 22.235 B/sec (0.369 instr. per cycle)
> > > > > stress-ng: info: [6872] 36,547,456 Page Faults Total
> > > > > 0.547 M/sec
> > > > > stress-ng: info: [6872] 36,547,456 Page Faults Minor
> > > > > 0.547 M/sec
> > > > >
> > > > > =========================
> > > > > #!/bin/bash
> > > > >
> > > > > # Get the PIDs of stress-ng processes
> > > > > PIDS=$(pgrep stress-ng)
> > > > >
> > > > > # Loop through each PID and monitor /proc/[pid]/status
> > > > > for PID in $PIDS; do
> > > > > while true; do
> > > > > cat /proc/$PID/status
> > > > > usleep 100000
> > > >
> > > > Hm but this limits the reading to 10 per second? If we want to simulate an
> > > > adversary process, it should be without the sleeps I think?
> > >
> > > OK. I drop the usleep, and I still can not see obvious impact.
> > >
> > > w/o patch:
> > > stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles
> > > 67.327 B/sec
> > > stress-ng: info: [6848] 1,616,524,844,832 Instructions
> > > 24.740 B/sec (0.367 instr. per cycle)
> > > stress-ng: info: [6848] 39,529,792 Page Faults Total
> > > 0.605 M/sec
> > > stress-ng: info: [6848] 39,529,792 Page Faults Minor
> > > 0.605 M/sec
> > >
> > > w/patch:
> > > stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles
> > > 68.382 B/sec
> > > stress-ng: info: [2485] 1,615,101,503,296 Instructions
> > > 24.750 B/sec (0.362 instr. per cycle)
> > > stress-ng: info: [2485] 39,439,232 Page Faults Total
> > > 0.604 M/sec
> > > stress-ng: info: [2485] 39,439,232 Page Faults Minor
> > > 0.604 M/sec
> >
> > Is the above with 32 non-sleeping parallel reader scripts?
>
> Yes.
Thanks, this seems much more representative. Please update the changelog
with this. With that feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-06-05 6:32 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-24 1:59 [PATCH] mm: fix the inaccurate memory statistics issue for users Baolin Wang
2025-05-30 3:53 ` Andrew Morton
2025-05-30 13:39 ` Michal Hocko
2025-05-30 23:00 ` Andrew Morton
2025-06-03 8:08 ` Baolin Wang
2025-06-03 8:15 ` Michal Hocko
2025-06-03 8:32 ` Baolin Wang
2025-06-03 10:28 ` Michal Hocko
2025-06-03 14:22 ` Baolin Wang
2025-06-03 14:48 ` Michal Hocko
2025-06-03 17:29 ` Shakeel Butt
2025-06-04 12:46 ` Baolin Wang
2025-06-04 13:46 ` Vlastimil Babka
2025-06-04 14:16 ` Baolin Wang
2025-06-04 14:27 ` Vlastimil Babka
2025-06-04 16:54 ` Shakeel Butt
2025-06-05 0:48 ` Baolin Wang
2025-06-05 6:32 ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).