* [PATCH 0/9] sched/psi: some optimization and extension
@ 2022-07-21 4:04 Chengming Zhou
[not found] ` <20220721040439.2651-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
` (3 more replies)
0 siblings, 4 replies; 56+ messages in thread
From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw)
To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap
Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou
Hi all,
This patch series are some optimization and extension for PSI.
patch 1/9 fix periodic aggregation shut off problem introduced by earlier
commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").
patch 2/9 optimize task switch inside shared cgroups when in_memstall status
of prev task and next task are different.
patch 3-4 optimize and simplify PSI status tracking by don't change task
psi_flags when migrate CPU/cgroup.
patch 7-8 introduce new kernel cmdline parameter "psi_inner_cgroup=" to
configure whether or not to track PSI stall information for inner cgroups.
patch 9/9 introduce new PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure
stall information when CONFIG_IRQ_TIME_ACCOUNTING.
Performance test on Intel Xeon Platinum with 3 levels of cgroup, in which
run mmtests config-scheduler-perfpipe:
tip tip tip patched patched patched patched
default cgroup_disable=pressure IRQ_TIME_ACCOUNTING default psi_inner_cgroup=off PSI_IRQ PSI_IRQ + psi_inner_cgroup=off
Min Time 9.89 ( 0.00%) 8.99 ( 9.12%) 10.04 ( -1.53%) 9.63 ( 2.58%) 9.27 ( 6.22%) 10.09 ( -2.04%) 9.45 ( 4.41%)
1st-qrtle Time 10.01 ( 0.00%) 9.15 ( 8.66%) 10.16 ( -1.45%) 9.72 ( 2.89%) 9.35 ( 6.61%) 10.20 ( -1.81%) 9.54 ( 4.77%)
2nd-qrtle Time 10.07 ( 0.00%) 9.25 ( 8.12%) 10.19 ( -1.21%) 9.79 ( 2.73%) 9.38 ( 6.78%) 10.24 ( -1.75%) 9.59 ( 4.68%)
3rd-qrtle Time 10.14 ( 0.00%) 9.30 ( 8.32%) 10.23 ( -0.88%) 9.84 ( 3.00%) 9.44 ( 6.92%) 10.27 ( -1.21%) 9.62 ( 5.18%)
Max-1 Time 9.89 ( 0.00%) 8.99 ( 9.12%) 10.04 ( -1.53%) 9.63 ( 2.58%) 9.27 ( 6.22%) 10.09 ( -2.04%) 9.45 ( 4.41%)
Max-5 Time 9.89 ( 0.00%) 8.99 ( 9.12%) 10.04 ( -1.53%) 9.63 ( 2.58%) 9.27 ( 6.22%) 10.09 ( -2.04%) 9.45 ( 4.41%)
Max-10 Time 9.92 ( 0.00%) 9.09 ( 8.33%) 10.11 ( -1.97%) 9.67 ( 2.51%) 9.29 ( 6.29%) 10.15 ( -2.30%) 9.48 ( 4.46%)
Max-90 Time 10.20 ( 0.00%) 9.33 ( 8.53%) 10.33 ( -1.24%) 9.87 ( 3.29%) 9.49 ( 6.99%) 10.29 ( -0.85%) 9.66 ( 5.32%)
Max-95 Time 10.23 ( 0.00%) 9.34 ( 8.70%) 10.37 ( -1.39%) 9.94 ( 2.83%) 9.53 ( 6.88%) 10.30 ( -0.65%) 9.67 ( 5.51%)
Max-99 Time 10.23 ( 0.00%) 9.37 ( 8.43%) 10.40 ( -1.63%) 9.99 ( 2.41%) 9.76 ( 4.57%) 10.31 ( -0.74%) 9.69 ( 5.25%)
Max Time 10.34 ( 0.00%) 9.46 ( 8.50%) 10.43 ( -0.83%) 17.04 ( -64.80%) 9.79 ( 5.36%) 10.32 ( 0.20%) 9.71 ( 6.07%)
Amean Time 10.08 ( 0.00%) 9.23 * 8.39%* 10.21 * -1.33%* 10.03 ( 0.47%) 9.41 * 6.59%* 10.23 * -1.53%* 9.59 * 4.87%*
Thanks!
Chengming Zhou (9):
sched/psi: fix periodic aggregation shut off
sched/psi: optimize task switch inside shared cgroups again
sched/psi: move private helpers to sched/stats.h
sched/psi: don't change task psi_flags when migrate CPU/group
sched/psi: don't create cgroup PSI files when psi_disabled
sched/psi: save percpu memory when !psi_cgroups_enabled
sched/psi: cache parent psi_group to speed up groups iterate
sched/psi: add kernel cmdline parameter psi_inner_cgroup
sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
.../admin-guide/kernel-parameters.txt | 11 +
include/linux/psi.h | 5 +-
include/linux/psi_types.h | 9 +-
include/linux/sched.h | 3 -
kernel/cgroup/cgroup.c | 30 +++
kernel/sched/core.c | 2 +
kernel/sched/psi.c | 194 +++++++++++++-----
kernel/sched/stats.h | 71 ++++---
8 files changed, 232 insertions(+), 93 deletions(-)
--
2.36.1
^ permalink raw reply [flat|nested] 56+ messages in thread[parent not found: <20220721040439.2651-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* [PATCH 1/9] sched/psi: fix periodic aggregation shut off 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou ` (2 subsequent siblings) 3 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore. So this patch move this check to psi_task_switch(). Note for defer sleep case, we should wake periodic avgs work for common ancestors groups, since those groups have next task sched_in. Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- kernel/sched/psi.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a337f3e35997..c8a4e644cd2c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; - bool wake_clock = true; void *iter = NULL; u64 now; @@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int clear, int set) psi_flags_change(task, clear, set); now = cpu_clock(cpu); - /* - * Periodic aggregation shuts off if there is a period of no - * task changes, so we wake it back up if necessary. However, - * don't do this if the task change is the aggregation worker - * itself going to sleep, or we'll ping-pong forever. - */ - if (unlikely((clear & TSK_RUNNING) && - (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) == psi_avgs_work)) - wake_clock = false; while ((group = iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, now, true); } void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, if (prev->pid) { int clear = TSK_ONCPU, set = 0; + bool wake_clock = true; /* * When we're going to sleep, psi_dequeue() lets us @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, clear |= TSK_MEMSTALL_RUNNING; if (prev->in_iowait) set |= TSK_IOWAIT; + + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((prev->flags & PF_WQ_WORKER) && + wq_worker_last_func(prev) == psi_avgs_work)) + wake_clock = false; } psi_flags_change(prev, clear, set); iter = NULL; while ((group = iterate_groups(prev, &iter)) && group != common) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 1/9] sched/psi: fix periodic aggregation shut off @ 2022-07-21 4:04 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore. So this patch move this check to psi_task_switch(). Note for defer sleep case, we should wake periodic avgs work for common ancestors groups, since those groups have next task sched_in. Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- kernel/sched/psi.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a337f3e35997..c8a4e644cd2c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; - bool wake_clock = true; void *iter = NULL; u64 now; @@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int clear, int set) psi_flags_change(task, clear, set); now = cpu_clock(cpu); - /* - * Periodic aggregation shuts off if there is a period of no - * task changes, so we wake it back up if necessary. However, - * don't do this if the task change is the aggregation worker - * itself going to sleep, or we'll ping-pong forever. - */ - if (unlikely((clear & TSK_RUNNING) && - (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) == psi_avgs_work)) - wake_clock = false; while ((group = iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, now, true); } void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, if (prev->pid) { int clear = TSK_ONCPU, set = 0; + bool wake_clock = true; /* * When we're going to sleep, psi_dequeue() lets us @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, clear |= TSK_MEMSTALL_RUNNING; if (prev->in_iowait) set |= TSK_IOWAIT; + + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((prev->flags & PF_WQ_WORKER) && + wq_worker_last_func(prev) == psi_avgs_work)) + wake_clock = false; } psi_flags_change(prev, clear, set); iter = NULL; while ((group = iterate_groups(prev, &iter)) && group != common) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off 2022-07-21 4:04 ` Chengming Zhou (?) @ 2022-07-25 15:34 ` Johannes Weiner -1 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 15:34 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote: > We don't want to wake periodic aggregation work back up if the > task change is the aggregation worker itself going to sleep, or > we'll ping-pong forever. > > Previously, we would use psi_task_change() in psi_dequeue() when > task going to sleep, so this check was put in psi_task_change(). > > But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") > defer task sleep handling to psi_task_switch(), won't go through > psi_task_change() anymore. > > So this patch move this check to psi_task_switch(). Note for defer sleep > case, we should wake periodic avgs work for common ancestors groups, > since those groups have next task sched_in. > > Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Good catch! Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <20220721040439.2651-2-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off 2022-07-21 4:04 ` Chengming Zhou @ 2022-07-25 15:39 ` Johannes Weiner -1 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 15:39 UTC (permalink / raw) To: Chengming Zhou Cc: surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote: > @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > clear |= TSK_MEMSTALL_RUNNING; > if (prev->in_iowait) > set |= TSK_IOWAIT; > + > + /* > + * Periodic aggregation shuts off if there is a period of no > + * task changes, so we wake it back up if necessary. However, > + * don't do this if the task change is the aggregation worker > + * itself going to sleep, or we'll ping-pong forever. > + */ > + if (unlikely((prev->flags & PF_WQ_WORKER) && > + wq_worker_last_func(prev) == psi_avgs_work)) > + wake_clock = false; > } > > psi_flags_change(prev, clear, set); > > iter = NULL; > while ((group = iterate_groups(prev, &iter)) && group != common) > - psi_group_change(group, cpu, clear, set, now, true); > + psi_group_change(group, cpu, clear, set, now, wake_clock); > > /* > * TSK_ONCPU is handled up to the common ancestor. If we're tasked Wait, there is another psi_group_change() below this, which handles the clearing of TSK_RUNNING for common ancestors. We don't want to wake those either, so it needs s/true/wake_clock/ as well. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off @ 2022-07-25 15:39 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 15:39 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote: > @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > clear |= TSK_MEMSTALL_RUNNING; > if (prev->in_iowait) > set |= TSK_IOWAIT; > + > + /* > + * Periodic aggregation shuts off if there is a period of no > + * task changes, so we wake it back up if necessary. However, > + * don't do this if the task change is the aggregation worker > + * itself going to sleep, or we'll ping-pong forever. > + */ > + if (unlikely((prev->flags & PF_WQ_WORKER) && > + wq_worker_last_func(prev) == psi_avgs_work)) > + wake_clock = false; > } > > psi_flags_change(prev, clear, set); > > iter = NULL; > while ((group = iterate_groups(prev, &iter)) && group != common) > - psi_group_change(group, cpu, clear, set, now, true); > + psi_group_change(group, cpu, clear, set, now, wake_clock); > > /* > * TSK_ONCPU is handled up to the common ancestor. If we're tasked Wait, there is another psi_group_change() below this, which handles the clearing of TSK_RUNNING for common ancestors. We don't want to wake those either, so it needs s/true/wake_clock/ as well. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 1/9] sched/psi: fix periodic aggregation shut off 2022-07-25 15:39 ` Johannes Weiner (?) @ 2022-07-26 13:28 ` Chengming Zhou -1 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-26 13:28 UTC (permalink / raw) To: Johannes Weiner, Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/25 23:39, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:31PM +0800, Chengming Zhou wrote: >> @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, >> clear |= TSK_MEMSTALL_RUNNING; >> if (prev->in_iowait) >> set |= TSK_IOWAIT; >> + >> + /* >> + * Periodic aggregation shuts off if there is a period of no >> + * task changes, so we wake it back up if necessary. However, >> + * don't do this if the task change is the aggregation worker >> + * itself going to sleep, or we'll ping-pong forever. >> + */ >> + if (unlikely((prev->flags & PF_WQ_WORKER) && >> + wq_worker_last_func(prev) == psi_avgs_work)) >> + wake_clock = false; >> } >> >> psi_flags_change(prev, clear, set); >> >> iter = NULL; >> while ((group = iterate_groups(prev, &iter)) && group != common) >> - psi_group_change(group, cpu, clear, set, now, true); >> + psi_group_change(group, cpu, clear, set, now, wake_clock); >> >> /* >> * TSK_ONCPU is handled up to the common ancestor. If we're tasked > > Wait, there is another psi_group_change() below this, which handles > the clearing of TSK_RUNNING for common ancestors. We don't want to > wake those either, so it needs s/true/wake_clock/ as well. Yes, I was wrong, will fix. Thanks! ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou ` (2 subsequent siblings) 3 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=TSK_ONCPU) for D 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A But there is a limitation "prev->psi_flags == next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall != next->in_memstall when sleep switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall != next->in_memstall when running switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index c8a4e644cd2c..e04041d8251b 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu); if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, true); -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again @ 2022-07-21 4:04 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=TSK_ONCPU) for D 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A But there is a limitation "prev->psi_flags == next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall != next->in_memstall when sleep switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall != next->in_memstall when running switch: 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index c8a4e644cd2c..e04041d8251b 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu); if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, true); -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 3/9] sched/psi: move private helpers to sched/stats.h 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou ` (2 subsequent siblings) 3 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- include/linux/psi.h | 4 ---- kernel/sched/stats.h | 4 ++++ 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index 89784763d19e..aa168a038242 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -18,10 +18,6 @@ extern struct psi_group psi_system; void psi_init(void); -void psi_task_change(struct task_struct *task, int clear, int set); -void psi_task_switch(struct task_struct *prev, struct task_struct *next, - bool sleep); - void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index baa839c1ba96..c39b467ece43 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se) } #ifdef CONFIG_PSI +void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); + /* * PSI tracks state that persists across sleeps, such as iowaits and * memory stalls. As a result, it has to distinguish between sleeps, -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 3/9] sched/psi: move private helpers to sched/stats.h @ 2022-07-21 4:04 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- include/linux/psi.h | 4 ---- kernel/sched/stats.h | 4 ++++ 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index 89784763d19e..aa168a038242 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -18,10 +18,6 @@ extern struct psi_group psi_system; void psi_init(void); -void psi_task_change(struct task_struct *task, int clear, int set); -void psi_task_switch(struct task_struct *prev, struct task_struct *next, - bool sleep); - void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index baa839c1ba96..c39b467ece43 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se) } #ifdef CONFIG_PSI +void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); + /* * PSI tracks state that persists across sleeps, such as iowaits and * memory stalls. As a result, it has to distinguish between sleeps, -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 3/9] sched/psi: move private helpers to sched/stats.h 2022-07-21 4:04 ` Chengming Zhou (?) @ 2022-07-25 16:39 ` Johannes Weiner -1 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 16:39 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:33PM +0800, Chengming Zhou wrote: > This patch move psi_task_change/psi_task_switch declarations out of > PSI public header, since they are only needed for implementing the > PSI stats tracking in sched/stats.h > > psi_task_switch is obvious, psi_task_change can't be public helper > since it doesn't check psi_disabled static key. And there is no > any user now, so put it in sched/stats.h too. > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou ` (2 subsequent siblings) 3 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou The current code use psi_task_change() at every scheduling point, in which change task psi_flags then change all its psi_groups. So we have to heavily rely on the task scheduling states to calculate what to set and what to clear at every scheduling point, which make the PSI stats tracking code much complex and error prone. In fact, the task psi_flags only change at wakeup and sleep (except ONCPU state at switch), it doesn't change at all when migrate CPU/group. If we keep its psi_flags unchanged when migrate CPU/group, we can just use task->psi_flags to clear(migrate out) or set(migrate in), which will make PSI stats tracking much simplier and more efficient. Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't include wakeup new task, so add psi_enqueue() in wake_up_new_task(). Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.034 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.210 [sec] 8.210600 usecs/op 121793 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- include/linux/sched.h | 3 --- kernel/sched/core.c | 1 + kernel/sched/psi.c | 24 ++++++++++--------- kernel/sched/stats.h | 54 +++++++++++++++++++++---------------------- 4 files changed, 40 insertions(+), 42 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 88b8817b827d..20a94786cad8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -879,9 +879,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; -#ifdef CONFIG_PSI - unsigned sched_psi_wake_requeue:1; -#endif /* Force alignment to the next boundary: */ unsigned :0; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a463dbc92fcd..f5f2d3542b05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p) post_init_entity_util_avg(p); activate_task(rq, p, ENQUEUE_NOCLOCK); + psi_enqueue(p, true); trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); #ifdef CONFIG_SMP diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index e04041d8251b..6ba159fe2a4f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) task->psi_flags |= set; } -void psi_task_change(struct task_struct *task, int clear, int set) +void psi_change_groups(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; void *iter = NULL; - u64 now; + u64 now = cpu_clock(cpu); + + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set, now, true); +} +void psi_task_change(struct task_struct *task, int clear, int set) +{ if (!task->pid) return; psi_flags_change(task, clear, set); - - now = cpu_clock(cpu); - - while ((group = iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, true); + psi_change_groups(task, clear, set); } void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) * pick_next_task() * rq_unlock() * rq_lock() - * psi_task_change() // old cgroup + * psi_change_groups() // old cgroup * task->cgroups = to - * psi_task_change() // new cgroup + * psi_change_groups() // new cgroup * rq_unlock() * rq_lock() * psi_sched_switch() // does deferred updates in new cgroup @@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) task_flags = task->psi_flags; if (task_flags) - psi_task_change(task, task_flags, 0); + psi_change_groups(task, task_flags, 0); /* See comment above */ rcu_assign_pointer(task->cgroups, to); if (task_flags) - psi_task_change(task, 0, task_flags); + psi_change_groups(task, 0, task_flags); task_rq_unlock(rq, task, &rf); } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c39b467ece43..e930b8fa6253 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se) } #ifdef CONFIG_PSI +void psi_change_groups(struct task_struct *task, int clear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); @@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup) if (static_branch_likely(&psi_disabled)) return; - if (p->in_memstall) - set |= TSK_MEMSTALL_RUNNING; + if (!wakeup) { + if (p->psi_flags) + psi_change_groups(p, 0, p->psi_flags); + return; + } - if (!wakeup || p->sched_psi_wake_requeue) { - if (p->in_memstall) + /* + * wakeup (including wakeup migrate) need to change task psi_flags, + * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING. + * Since we clear task->psi_flags for wakeup migrated task, we need + * to check task->psi_flags to see what should be set and clear. + */ + if (unlikely(p->in_memstall)) { + set |= TSK_MEMSTALL_RUNNING; + if (!(p->psi_flags & TSK_MEMSTALL)) set |= TSK_MEMSTALL; - if (p->sched_psi_wake_requeue) - p->sched_psi_wake_requeue = 0; - } else { - if (p->in_iowait) - clear |= TSK_IOWAIT; } + if (p->psi_flags & TSK_IOWAIT) + clear |= TSK_IOWAIT; psi_task_change(p, clear, set); } static inline void psi_dequeue(struct task_struct *p, bool sleep) { - int clear = TSK_RUNNING; - if (static_branch_likely(&psi_disabled)) return; + if (!sleep) { + if (p->psi_flags) + psi_change_groups(p, p->psi_flags, 0); + return; + } + /* * A voluntary sleep is a dequeue followed by a task switch. To * avoid walking all ancestors twice, psi_task_switch() handles * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * Do nothing here. */ - if (sleep) - return; - - if (p->in_memstall) - clear |= (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING); - - psi_task_change(p, clear, 0); } static inline void psi_ttwu_dequeue(struct task_struct *p) @@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) /* * Is the task being migrated during a wakeup? Make sure to * deregister its sleep-persistent psi states from the old - * queue, and let psi_enqueue() know it has to requeue. + * queue. */ - if (unlikely(p->in_iowait || p->in_memstall)) { + if (unlikely(p->psi_flags)) { struct rq_flags rf; struct rq *rq; - int clear = 0; - - if (p->in_iowait) - clear |= TSK_IOWAIT; - if (p->in_memstall) - clear |= TSK_MEMSTALL; rq = __task_rq_lock(p, &rf); - psi_task_change(p, clear, 0); - p->sched_psi_wake_requeue = 1; + psi_task_change(p, p->psi_flags, 0); __task_rq_unlock(rq, &rf); } } -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group @ 2022-07-21 4:04 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou The current code use psi_task_change() at every scheduling point, in which change task psi_flags then change all its psi_groups. So we have to heavily rely on the task scheduling states to calculate what to set and what to clear at every scheduling point, which make the PSI stats tracking code much complex and error prone. In fact, the task psi_flags only change at wakeup and sleep (except ONCPU state at switch), it doesn't change at all when migrate CPU/group. If we keep its psi_flags unchanged when migrate CPU/group, we can just use task->psi_flags to clear(migrate out) or set(migrate in), which will make PSI stats tracking much simplier and more efficient. Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't include wakeup new task, so add psi_enqueue() in wake_up_new_task(). Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.034 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.210 [sec] 8.210600 usecs/op 121793 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- include/linux/sched.h | 3 --- kernel/sched/core.c | 1 + kernel/sched/psi.c | 24 ++++++++++--------- kernel/sched/stats.h | 54 +++++++++++++++++++++---------------------- 4 files changed, 40 insertions(+), 42 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 88b8817b827d..20a94786cad8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -879,9 +879,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; -#ifdef CONFIG_PSI - unsigned sched_psi_wake_requeue:1; -#endif /* Force alignment to the next boundary: */ unsigned :0; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a463dbc92fcd..f5f2d3542b05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p) post_init_entity_util_avg(p); activate_task(rq, p, ENQUEUE_NOCLOCK); + psi_enqueue(p, true); trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); #ifdef CONFIG_SMP diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index e04041d8251b..6ba159fe2a4f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) task->psi_flags |= set; } -void psi_task_change(struct task_struct *task, int clear, int set) +void psi_change_groups(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; void *iter = NULL; - u64 now; + u64 now = cpu_clock(cpu); + + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set, now, true); +} +void psi_task_change(struct task_struct *task, int clear, int set) +{ if (!task->pid) return; psi_flags_change(task, clear, set); - - now = cpu_clock(cpu); - - while ((group = iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, true); + psi_change_groups(task, clear, set); } void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) * pick_next_task() * rq_unlock() * rq_lock() - * psi_task_change() // old cgroup + * psi_change_groups() // old cgroup * task->cgroups = to - * psi_task_change() // new cgroup + * psi_change_groups() // new cgroup * rq_unlock() * rq_lock() * psi_sched_switch() // does deferred updates in new cgroup @@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) task_flags = task->psi_flags; if (task_flags) - psi_task_change(task, task_flags, 0); + psi_change_groups(task, task_flags, 0); /* See comment above */ rcu_assign_pointer(task->cgroups, to); if (task_flags) - psi_task_change(task, 0, task_flags); + psi_change_groups(task, 0, task_flags); task_rq_unlock(rq, task, &rf); } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c39b467ece43..e930b8fa6253 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se) } #ifdef CONFIG_PSI +void psi_change_groups(struct task_struct *task, int clear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); @@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup) if (static_branch_likely(&psi_disabled)) return; - if (p->in_memstall) - set |= TSK_MEMSTALL_RUNNING; + if (!wakeup) { + if (p->psi_flags) + psi_change_groups(p, 0, p->psi_flags); + return; + } - if (!wakeup || p->sched_psi_wake_requeue) { - if (p->in_memstall) + /* + * wakeup (including wakeup migrate) need to change task psi_flags, + * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING. + * Since we clear task->psi_flags for wakeup migrated task, we need + * to check task->psi_flags to see what should be set and clear. + */ + if (unlikely(p->in_memstall)) { + set |= TSK_MEMSTALL_RUNNING; + if (!(p->psi_flags & TSK_MEMSTALL)) set |= TSK_MEMSTALL; - if (p->sched_psi_wake_requeue) - p->sched_psi_wake_requeue = 0; - } else { - if (p->in_iowait) - clear |= TSK_IOWAIT; } + if (p->psi_flags & TSK_IOWAIT) + clear |= TSK_IOWAIT; psi_task_change(p, clear, set); } static inline void psi_dequeue(struct task_struct *p, bool sleep) { - int clear = TSK_RUNNING; - if (static_branch_likely(&psi_disabled)) return; + if (!sleep) { + if (p->psi_flags) + psi_change_groups(p, p->psi_flags, 0); + return; + } + /* * A voluntary sleep is a dequeue followed by a task switch. To * avoid walking all ancestors twice, psi_task_switch() handles * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * Do nothing here. */ - if (sleep) - return; - - if (p->in_memstall) - clear |= (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING); - - psi_task_change(p, clear, 0); } static inline void psi_ttwu_dequeue(struct task_struct *p) @@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) /* * Is the task being migrated during a wakeup? Make sure to * deregister its sleep-persistent psi states from the old - * queue, and let psi_enqueue() know it has to requeue. + * queue. */ - if (unlikely(p->in_iowait || p->in_memstall)) { + if (unlikely(p->psi_flags)) { struct rq_flags rf; struct rq *rq; - int clear = 0; - - if (p->in_iowait) - clear |= TSK_IOWAIT; - if (p->in_memstall) - clear |= TSK_MEMSTALL; rq = __task_rq_lock(p, &rf); - psi_task_change(p, clear, 0); - p->sched_psi_wake_requeue = 1; + psi_task_change(p, p->psi_flags, 0); __task_rq_unlock(rq, &rf); } } -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou ` (2 subsequent siblings) 3 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. And this patch is preparation for the following patch, in which we can configure PSI to only account for leaf cgroups and system-wide. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- include/linux/psi_types.h | 2 ++ kernel/sched/psi.c | 48 ++++++++++++++++++++------------------- 2 files changed, 27 insertions(+), 23 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c7fe7c089718..c124f7d186d0 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -147,6 +147,8 @@ struct psi_trigger { }; struct psi_group { + struct psi_group *parent; + /* Protects data used by the aggregator */ struct mutex avgs_lock; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index aa40bf888102..2228cbf3bdd3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -758,30 +758,22 @@ static void psi_group_change(struct psi_group *group, int cpu, schedule_delayed_work(&group->avgs_work, PSI_FREQ); } -static struct psi_group *iterate_groups(struct task_struct *task, void **iter) +static inline struct psi_group *task_psi_group(struct task_struct *task) { - if (*iter == &psi_system) - return NULL; - #ifdef CONFIG_CGROUPS if (static_branch_likely(&psi_cgroups_enabled)) { - struct cgroup *cgroup = NULL; - - if (!*iter) - cgroup = task->cgroups->dfl_cgrp; - else - cgroup = cgroup_parent(*iter); + struct cgroup *cgroup = task_dfl_cgroup(task); - if (cgroup && cgroup_parent(cgroup)) { - *iter = cgroup; + if (cgroup && cgroup_parent(cgroup)) return cgroup_psi(cgroup); - } } #endif - *iter = &psi_system; return &psi_system; } +#define for_each_psi_group(group) \ + for (; group; group = group->parent) + static void psi_flags_change(struct task_struct *task, int clear, int set) { if (((task->psi_flags & set) || @@ -799,12 +791,11 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) void psi_change_groups(struct task_struct *task, int clear, int set) { + struct psi_group *group = task_psi_group(task); int cpu = task_cpu(task); - struct psi_group *group; - void *iter = NULL; u64 now = cpu_clock(cpu); - while ((group = iterate_groups(task, &iter))) + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } @@ -822,7 +813,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, { struct psi_group *group, *common = NULL; int cpu = task_cpu(prev); - void *iter; u64 now = cpu_clock(cpu); if (next->pid) { @@ -833,8 +823,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - iter = NULL; - while ((group = iterate_groups(next, &iter))) { + group = task_psi_group(next); + for_each_psi_group(group) { if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; @@ -874,9 +864,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, psi_flags_change(prev, clear, set); - iter = NULL; - while ((group = iterate_groups(prev, &iter)) && group != common) + group = task_psi_group(prev); + for_each_psi_group(group) { + if (group == common) + break; psi_group_change(group, cpu, clear, set, now, wake_clock); + } /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -884,7 +877,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, */ if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { clear &= ~TSK_ONCPU; - for (; group; group = iterate_groups(prev, &iter)) + + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } } @@ -953,6 +947,8 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { + struct cgroup *parent; + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; @@ -960,6 +956,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); + + parent = cgroup_parent(cgroup); + if (parent && cgroup_parent(parent)) + cgroup->psi.parent = cgroup_psi(parent); + else + cgroup->psi.parent = &psi_system; return 0; } -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate @ 2022-07-21 4:04 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. And this patch is preparation for the following patch, in which we can configure PSI to only account for leaf cgroups and system-wide. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- include/linux/psi_types.h | 2 ++ kernel/sched/psi.c | 48 ++++++++++++++++++++------------------- 2 files changed, 27 insertions(+), 23 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c7fe7c089718..c124f7d186d0 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -147,6 +147,8 @@ struct psi_trigger { }; struct psi_group { + struct psi_group *parent; + /* Protects data used by the aggregator */ struct mutex avgs_lock; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index aa40bf888102..2228cbf3bdd3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -758,30 +758,22 @@ static void psi_group_change(struct psi_group *group, int cpu, schedule_delayed_work(&group->avgs_work, PSI_FREQ); } -static struct psi_group *iterate_groups(struct task_struct *task, void **iter) +static inline struct psi_group *task_psi_group(struct task_struct *task) { - if (*iter == &psi_system) - return NULL; - #ifdef CONFIG_CGROUPS if (static_branch_likely(&psi_cgroups_enabled)) { - struct cgroup *cgroup = NULL; - - if (!*iter) - cgroup = task->cgroups->dfl_cgrp; - else - cgroup = cgroup_parent(*iter); + struct cgroup *cgroup = task_dfl_cgroup(task); - if (cgroup && cgroup_parent(cgroup)) { - *iter = cgroup; + if (cgroup && cgroup_parent(cgroup)) return cgroup_psi(cgroup); - } } #endif - *iter = &psi_system; return &psi_system; } +#define for_each_psi_group(group) \ + for (; group; group = group->parent) + static void psi_flags_change(struct task_struct *task, int clear, int set) { if (((task->psi_flags & set) || @@ -799,12 +791,11 @@ static void psi_flags_change(struct task_struct *task, int clear, int set) void psi_change_groups(struct task_struct *task, int clear, int set) { + struct psi_group *group = task_psi_group(task); int cpu = task_cpu(task); - struct psi_group *group; - void *iter = NULL; u64 now = cpu_clock(cpu); - while ((group = iterate_groups(task, &iter))) + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } @@ -822,7 +813,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, { struct psi_group *group, *common = NULL; int cpu = task_cpu(prev); - void *iter; u64 now = cpu_clock(cpu); if (next->pid) { @@ -833,8 +823,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - iter = NULL; - while ((group = iterate_groups(next, &iter))) { + group = task_psi_group(next); + for_each_psi_group(group) { if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; @@ -874,9 +864,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, psi_flags_change(prev, clear, set); - iter = NULL; - while ((group = iterate_groups(prev, &iter)) && group != common) + group = task_psi_group(prev); + for_each_psi_group(group) { + if (group == common) + break; psi_group_change(group, cpu, clear, set, now, wake_clock); + } /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -884,7 +877,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, */ if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { clear &= ~TSK_ONCPU; - for (; group; group = iterate_groups(prev, &iter)) + + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } } @@ -953,6 +947,8 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { + struct cgroup *parent; + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; @@ -960,6 +956,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); + + parent = cgroup_parent(cgroup); + if (parent && cgroup_parent(parent)) + cgroup->psi.parent = cgroup_psi(parent); + else + cgroup->psi.parent = &psi_system; return 0; } -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou ` (2 subsequent siblings) 3 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may case non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. For our use case, we also want leaf cgroup PSI accounted for userspace adjustment on that cgroup, apart from only system-wide management. So this patch add kernel cmdline parameter "psi_inner_cgroup" to control whether or not to account for inner cgroups, which is default to true for compatibility. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. default (psi_inner_cgroup=true) $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec 2. psi_inner_cgroup=false $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.309 [sec] 7.309436 usecs/op 136809 ops/sec Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++++ kernel/sched/psi.c | 11 ++++++++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 8090130b544b..6beef5b8bc36 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4419,6 +4419,12 @@ tracking. Format: <bool> + psi_inner_cgroup= + [KNL] Enable or disable pressure stall information + tracking for the inner cgroups. + Format: <bool> + default: enabled + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 2228cbf3bdd3..8d76920f47b3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -147,12 +147,21 @@ static bool psi_enable; #else static bool psi_enable = true; #endif + +static bool psi_inner_cgroup __read_mostly = true; + static int __init setup_psi(char *str) { return kstrtobool(str, &psi_enable) == 0; } __setup("psi=", setup_psi); +static int __init setup_psi_inner_cgroup(char *str) +{ + return kstrtobool(str, &psi_inner_cgroup) == 0; +} +__setup("psi_inner_cgroup=", setup_psi_inner_cgroup); + /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ @@ -958,7 +967,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) group_init(&cgroup->psi); parent = cgroup_parent(cgroup); - if (parent && cgroup_parent(parent)) + if (parent && cgroup_parent(parent) && psi_inner_cgroup) cgroup->psi.parent = cgroup_psi(parent); else cgroup->psi.parent = &psi_system; -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup @ 2022-07-21 4:04 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may case non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. For our use case, we also want leaf cgroup PSI accounted for userspace adjustment on that cgroup, apart from only system-wide management. So this patch add kernel cmdline parameter "psi_inner_cgroup" to control whether or not to account for inner cgroups, which is default to true for compatibility. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. default (psi_inner_cgroup=true) $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec 2. psi_inner_cgroup=false $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.309 [sec] 7.309436 usecs/op 136809 ops/sec Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++++ kernel/sched/psi.c | 11 ++++++++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 8090130b544b..6beef5b8bc36 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4419,6 +4419,12 @@ tracking. Format: <bool> + psi_inner_cgroup= + [KNL] Enable or disable pressure stall information + tracking for the inner cgroups. + Format: <bool> + default: enabled + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 2228cbf3bdd3..8d76920f47b3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -147,12 +147,21 @@ static bool psi_enable; #else static bool psi_enable = true; #endif + +static bool psi_inner_cgroup __read_mostly = true; + static int __init setup_psi(char *str) { return kstrtobool(str, &psi_enable) == 0; } __setup("psi=", setup_psi); +static int __init setup_psi_inner_cgroup(char *str) +{ + return kstrtobool(str, &psi_inner_cgroup) == 0; +} +__setup("psi_inner_cgroup=", setup_psi_inner_cgroup); + /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ @@ -958,7 +967,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) group_init(&cgroup->psi); parent = cgroup_parent(cgroup); - if (parent && cgroup_parent(parent)) + if (parent && cgroup_parent(parent) && psi_inner_cgroup) cgroup->psi.parent = cgroup_psi(parent); else cgroup->psi.parent = &psi_system; -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
[parent not found: <20220721040439.2651-9-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-07-21 4:04 ` Chengming Zhou @ 2022-07-25 16:52 ` Johannes Weiner -1 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 16:52 UTC (permalink / raw) To: Chengming Zhou Cc: surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: > PSI accounts stalls for each cgroup separately and aggregates it > at each level of the hierarchy. This may case non-negligible overhead > for some workloads when under deep level of the hierarchy. > > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") > make PSI to skip per-cgroup stall accounting, only account system-wide > to avoid this each level overhead. > > For our use case, we also want leaf cgroup PSI accounted for userspace > adjustment on that cgroup, apart from only system-wide management. I hear the overhead argument. But skipping accounting in intermediate levels is a bit odd and unprecedented in the cgroup interface. Once we do this, it's conceivable people would like to do the same thing for other stats and accounting, like for instance memory.stat. Tejun, what are your thoughts on this? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup @ 2022-07-25 16:52 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 16:52 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: > PSI accounts stalls for each cgroup separately and aggregates it > at each level of the hierarchy. This may case non-negligible overhead > for some workloads when under deep level of the hierarchy. > > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") > make PSI to skip per-cgroup stall accounting, only account system-wide > to avoid this each level overhead. > > For our use case, we also want leaf cgroup PSI accounted for userspace > adjustment on that cgroup, apart from only system-wide management. I hear the overhead argument. But skipping accounting in intermediate levels is a bit odd and unprecedented in the cgroup interface. Once we do this, it's conceivable people would like to do the same thing for other stats and accounting, like for instance memory.stat. Tejun, what are your thoughts on this? ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <Yt7KQc0nnOypB2b2-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [External] Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-07-25 16:52 ` Johannes Weiner @ 2022-07-26 13:38 ` Chengming Zhou -1 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-26 13:38 UTC (permalink / raw) To: Johannes Weiner Cc: surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On 2022/7/26 00:52, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: >> PSI accounts stalls for each cgroup separately and aggregates it >> at each level of the hierarchy. This may case non-negligible overhead >> for some workloads when under deep level of the hierarchy. >> >> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") >> make PSI to skip per-cgroup stall accounting, only account system-wide >> to avoid this each level overhead. >> >> For our use case, we also want leaf cgroup PSI accounted for userspace >> adjustment on that cgroup, apart from only system-wide management. > > I hear the overhead argument. But skipping accounting in intermediate > levels is a bit odd and unprecedented in the cgroup interface. Once we > do this, it's conceivable people would like to do the same thing for > other stats and accounting, like for instance memory.stat. Right, it's a bit odd... We don't use PSI stats in intermediate levels in our use case, but don't know what other use scenarios are. If they are useful for other people, this patch can be dropped. Thanks. > > Tejun, what are your thoughts on this? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [External] Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup @ 2022-07-26 13:38 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-26 13:38 UTC (permalink / raw) To: Johannes Weiner Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/26 00:52, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: >> PSI accounts stalls for each cgroup separately and aggregates it >> at each level of the hierarchy. This may case non-negligible overhead >> for some workloads when under deep level of the hierarchy. >> >> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") >> make PSI to skip per-cgroup stall accounting, only account system-wide >> to avoid this each level overhead. >> >> For our use case, we also want leaf cgroup PSI accounted for userspace >> adjustment on that cgroup, apart from only system-wide management. > > I hear the overhead argument. But skipping accounting in intermediate > levels is a bit odd and unprecedented in the cgroup interface. Once we > do this, it's conceivable people would like to do the same thing for > other stats and accounting, like for instance memory.stat. Right, it's a bit odd... We don't use PSI stats in intermediate levels in our use case, but don't know what other use scenarios are. If they are useful for other people, this patch can be dropped. Thanks. > > Tejun, what are your thoughts on this? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-07-25 16:52 ` Johannes Weiner (?) (?) @ 2022-07-26 17:54 ` Tejun Heo [not found] ` <YuAqWprKd6NsWs7C-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> -1 siblings, 1 reply; 56+ messages in thread From: Tejun Heo @ 2022-07-26 17:54 UTC (permalink / raw) To: Johannes Weiner Cc: Chengming Zhou, surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups Hello, On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: > > PSI accounts stalls for each cgroup separately and aggregates it > > at each level of the hierarchy. This may case non-negligible overhead > > for some workloads when under deep level of the hierarchy. > > > > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") > > make PSI to skip per-cgroup stall accounting, only account system-wide > > to avoid this each level overhead. > > > > For our use case, we also want leaf cgroup PSI accounted for userspace > > adjustment on that cgroup, apart from only system-wide management. > > I hear the overhead argument. But skipping accounting in intermediate > levels is a bit odd and unprecedented in the cgroup interface. Once we > do this, it's conceivable people would like to do the same thing for > other stats and accounting, like for instance memory.stat. > > Tejun, what are your thoughts on this? Given that PSI requires on-the-spot recursive accumulation unlike other stats, it can add quite a bit of overhead, so I'm sympathetic to the argument because PSI can't be made cheaper by kernel being better (or at least we don't know how to yet). That said, "leaf-only" feels really hacky to me. My memory is hazy but there's nothing preventing any cgroup from being skipped over when updating PSI states, right? The state count propagation is recursive but it's each task's state being propagated upwards not the child cgroup's, so we can skip over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on any given cgroup without worrying about affecting others. Am I correct? Assuming the above isn't wrong, if we can figure out how we can re-enable it, which is more difficult as the counters need to be resynchronized with the current state, that'd be ideal. Then, we can just allow each cgroup to enable / disable PSI reporting dynamically as they see fit. Thanks. -- tejun ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <YuAqWprKd6NsWs7C-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-07-26 17:54 ` Tejun Heo @ 2022-08-03 12:17 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-08-03 12:17 UTC (permalink / raw) To: Tejun Heo, Johannes Weiner Cc: surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On 2022/7/27 01:54, Tejun Heo wrote: > Hello, > > On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote: >> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: >>> PSI accounts stalls for each cgroup separately and aggregates it >>> at each level of the hierarchy. This may case non-negligible overhead >>> for some workloads when under deep level of the hierarchy. >>> >>> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") >>> make PSI to skip per-cgroup stall accounting, only account system-wide >>> to avoid this each level overhead. >>> >>> For our use case, we also want leaf cgroup PSI accounted for userspace >>> adjustment on that cgroup, apart from only system-wide management. >> >> I hear the overhead argument. But skipping accounting in intermediate >> levels is a bit odd and unprecedented in the cgroup interface. Once we >> do this, it's conceivable people would like to do the same thing for >> other stats and accounting, like for instance memory.stat. >> >> Tejun, what are your thoughts on this? > > Given that PSI requires on-the-spot recursive accumulation unlike other > stats, it can add quite a bit of overhead, so I'm sympathetic to the > argument because PSI can't be made cheaper by kernel being better (or at > least we don't know how to yet). > > That said, "leaf-only" feels really hacky to me. My memory is hazy but > there's nothing preventing any cgroup from being skipped over when updating > PSI states, right? The state count propagation is recursive but it's each > task's state being propagated upwards not the child cgroup's, so we can skip > over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on > any given cgroup without worrying about affecting others. Am I correct? Yes, I think it's correct. > > Assuming the above isn't wrong, if we can figure out how we can re-enable > it, which is more difficult as the counters need to be resynchronized with > the current state, that'd be ideal. Then, we can just allow each cgroup to > enable / disable PSI reporting dynamically as they see fit. This method is more fine-grained but more difficult like you said above. I think it may meet most needs to disable PSI stats in intermediate cgroups? Thanks! > > Thanks. > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup @ 2022-08-03 12:17 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-08-03 12:17 UTC (permalink / raw) To: Tejun Heo, Johannes Weiner Cc: surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/27 01:54, Tejun Heo wrote: > Hello, > > On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote: >> On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: >>> PSI accounts stalls for each cgroup separately and aggregates it >>> at each level of the hierarchy. This may case non-negligible overhead >>> for some workloads when under deep level of the hierarchy. >>> >>> commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") >>> make PSI to skip per-cgroup stall accounting, only account system-wide >>> to avoid this each level overhead. >>> >>> For our use case, we also want leaf cgroup PSI accounted for userspace >>> adjustment on that cgroup, apart from only system-wide management. >> >> I hear the overhead argument. But skipping accounting in intermediate >> levels is a bit odd and unprecedented in the cgroup interface. Once we >> do this, it's conceivable people would like to do the same thing for >> other stats and accounting, like for instance memory.stat. >> >> Tejun, what are your thoughts on this? > > Given that PSI requires on-the-spot recursive accumulation unlike other > stats, it can add quite a bit of overhead, so I'm sympathetic to the > argument because PSI can't be made cheaper by kernel being better (or at > least we don't know how to yet). > > That said, "leaf-only" feels really hacky to me. My memory is hazy but > there's nothing preventing any cgroup from being skipped over when updating > PSI states, right? The state count propagation is recursive but it's each > task's state being propagated upwards not the child cgroup's, so we can skip > over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on > any given cgroup without worrying about affecting others. Am I correct? Yes, I think it's correct. > > Assuming the above isn't wrong, if we can figure out how we can re-enable > it, which is more difficult as the counters need to be resynchronized with > the current state, that'd be ideal. Then, we can just allow each cgroup to > enable / disable PSI reporting dynamically as they see fit. This method is more fine-grained but more difficult like you said above. I think it may meet most needs to disable PSI stats in intermediate cgroups? Thanks! > > Thanks. > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-08-03 12:17 ` Chengming Zhou (?) @ 2022-08-03 17:58 ` Tejun Heo 2022-08-03 19:22 ` Johannes Weiner 2022-08-04 2:02 ` Chengming Zhou -1 siblings, 2 replies; 56+ messages in thread From: Tejun Heo @ 2022-08-03 17:58 UTC (permalink / raw) To: Chengming Zhou Cc: Johannes Weiner, surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups Hello, On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote: > > Assuming the above isn't wrong, if we can figure out how we can re-enable > > it, which is more difficult as the counters need to be resynchronized with > > the current state, that'd be ideal. Then, we can just allow each cgroup to > > enable / disable PSI reporting dynamically as they see fit. > > This method is more fine-grained but more difficult like you said above. > I think it may meet most needs to disable PSI stats in intermediate cgroups? So, I'm not necessarily against implementing something easier but we at least wanna get the interface right, so that if we decide to do the full thing later we can easily expand on the existing interface. ie. let's please not be too hacky. I don't think it'd be that difficult to implement per-cgroup disable-only operation that we can later expand to allow re-enabling, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-08-03 17:58 ` Tejun Heo @ 2022-08-03 19:22 ` Johannes Weiner 2022-08-03 19:48 ` Tejun Heo 2022-08-04 13:51 ` Chengming Zhou 2022-08-04 2:02 ` Chengming Zhou 1 sibling, 2 replies; 56+ messages in thread From: Johannes Weiner @ 2022-08-03 19:22 UTC (permalink / raw) To: Tejun Heo Cc: Chengming Zhou, surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote: > Hello, > > On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote: > > > Assuming the above isn't wrong, if we can figure out how we can re-enable > > > it, which is more difficult as the counters need to be resynchronized with > > > the current state, that'd be ideal. Then, we can just allow each cgroup to > > > enable / disable PSI reporting dynamically as they see fit. > > > > This method is more fine-grained but more difficult like you said above. > > I think it may meet most needs to disable PSI stats in intermediate cgroups? > > So, I'm not necessarily against implementing something easier but we at > least wanna get the interface right, so that if we decide to do the full > thing later we can easily expand on the existing interface. ie. let's please > not be too hacky. I don't think it'd be that difficult to implement > per-cgroup disable-only operation that we can later expand to allow > re-enabling, right? It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total= will have gaps - should be okay, IMO. Where it gets trickier is also stopping the tracking of task counts in a cgroup. For re-enabling afterwards, we'd have to freeze scheduler and cgroup state and find all tasks of interest across all CPUs for the given cgroup to recreate the counts. I'm not quite sure whether that's feasible, and if so, whether it's worth the savings. It might be good to benchmark the two disabling steps independently. Maybe stopping aggregation while keeping task counts is good enough, and we can commit to a disable/re-enable interface from the start. Or maybe it's all in the cachelines and iteration, and stopping the aggregation while still writing task counts isn't saving much. In that case we'd have to look closer at reconstructing task counts, to see if later re-enabling is actually a practical option or whether a one-off kill switch is more realistic. Chengming, can you experiment with disabling: record_times(), the test_state() loop and state_mask construction, and the averaging worker - while keeping the groupc->tasks updates? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-08-03 19:22 ` Johannes Weiner @ 2022-08-03 19:48 ` Tejun Heo 2022-08-04 13:51 ` Chengming Zhou 1 sibling, 0 replies; 56+ messages in thread From: Tejun Heo @ 2022-08-03 19:48 UTC (permalink / raw) To: Johannes Weiner Cc: Chengming Zhou, surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups Hello, On Wed, Aug 03, 2022 at 03:22:16PM -0400, Johannes Weiner wrote: > Where it gets trickier is also stopping the tracking of task counts in > a cgroup. For re-enabling afterwards, we'd have to freeze scheduler > and cgroup state and find all tasks of interest across all CPUs for > the given cgroup to recreate the counts. I'm not quite sure whether > that's feasible, and if so, whether it's worth the savings. If this turns out to be necessary, I wonder whether we can just be opportunistic. ie. don't bother with walking all the tasks but only remember whether a task is accounted at a given level or not (this can be a bitmap which is allocated at cgroup attach type and in most caess will be pretty small). Then, maybe we can just start accounting them as they cycle through state transitions - we ignore the ones leaving states that they weren't accounted for and start remembering the new states they enter. Thanks. -- tejun ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-08-03 19:22 ` Johannes Weiner 2022-08-03 19:48 ` Tejun Heo @ 2022-08-04 13:51 ` Chengming Zhou [not found] ` <f8444db4-3235-d108-698a-6772e03a6b67-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> 1 sibling, 1 reply; 56+ messages in thread From: Chengming Zhou @ 2022-08-04 13:51 UTC (permalink / raw) To: Johannes Weiner, Tejun Heo Cc: surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/8/4 03:22, Johannes Weiner wrote: > On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote: >> Hello, >> >> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote: >>>> Assuming the above isn't wrong, if we can figure out how we can re-enable >>>> it, which is more difficult as the counters need to be resynchronized with >>>> the current state, that'd be ideal. Then, we can just allow each cgroup to >>>> enable / disable PSI reporting dynamically as they see fit. >>> >>> This method is more fine-grained but more difficult like you said above. >>> I think it may meet most needs to disable PSI stats in intermediate cgroups? >> >> So, I'm not necessarily against implementing something easier but we at >> least wanna get the interface right, so that if we decide to do the full >> thing later we can easily expand on the existing interface. ie. let's please >> not be too hacky. I don't think it'd be that difficult to implement >> per-cgroup disable-only operation that we can later expand to allow >> re-enabling, right? > > It should be relatively straight-forward to disable and re-enable > state aggregation, time tracking, averaging on a per-cgroup level, if > we can live with losing history from while it was disabled. I.e. the > avgs will restart from 0, total= will have gaps - should be okay, IMO. > > Where it gets trickier is also stopping the tracking of task counts in > a cgroup. For re-enabling afterwards, we'd have to freeze scheduler > and cgroup state and find all tasks of interest across all CPUs for > the given cgroup to recreate the counts. I'm not quite sure whether > that's feasible, and if so, whether it's worth the savings. > > It might be good to benchmark the two disabling steps independently. > Maybe stopping aggregation while keeping task counts is good enough, > and we can commit to a disable/re-enable interface from the start. > > Or maybe it's all in the cachelines and iteration, and stopping the > aggregation while still writing task counts isn't saving much. In that > case we'd have to look closer at reconstructing task counts, to see if > later re-enabling is actually a practical option or whether a one-off > kill switch is more realistic. > > Chengming, can you experiment with disabling: record_times(), the > test_state() loop and state_mask construction, and the averaging > worker - while keeping the groupc->tasks updates? Hello, I did this experiment today with disabling record_times(), test_state() loop and averaging worker, while only keeping groupc->tasks[] updates, the results look promising. mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup: perfpipe tip tip patched psi=off psi=on only groupc->tasks[] Min Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) 1st-qrtle Time 8.11 ( 0.00%) 8.94 ( -10.22%) 8.39 ( -3.46%) 2nd-qrtle Time 8.17 ( 0.00%) 9.02 ( -10.42%) 8.44 ( -3.37%) 3rd-qrtle Time 8.20 ( 0.00%) 9.08 ( -10.72%) 8.48 ( -3.43%) Max-1 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) Max-5 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) Max-10 Time 8.09 ( 0.00%) 8.89 ( -9.96%) 8.35 ( -3.22%) Max-90 Time 8.31 ( 0.00%) 9.13 ( -9.90%) 8.55 ( -2.95%) Max-95 Time 8.32 ( 0.00%) 9.14 ( -9.88%) 8.55 ( -2.81%) Max-99 Time 8.39 ( 0.00%) 9.26 ( -10.30%) 8.57 ( -2.09%) Max Time 8.56 ( 0.00%) 9.26 ( -8.23%) 8.72 ( -1.90%) Amean Time 8.19 ( 0.00%) 9.03 * -10.26%* 8.45 * -3.27%* Tejun suggested using a bitmap in task to remember whether the task is accounted at a given level or not, which I think also is a very good idea, but I haven't clearly figure out how to do it. The above performance test result looks good to me, so I think we can implement this per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start, and we can change to a better implementation if needed later? Thanks! ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <f8444db4-3235-d108-698a-6772e03a6b67-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-08-04 13:51 ` Chengming Zhou @ 2022-08-04 16:56 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-08-04 16:56 UTC (permalink / raw) To: Chengming Zhou Cc: Tejun Heo, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On Thu, Aug 04, 2022 at 09:51:31PM +0800, Chengming Zhou wrote: > On 2022/8/4 03:22, Johannes Weiner wrote: > > On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote: > >> Hello, > >> > >> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote: > >>>> Assuming the above isn't wrong, if we can figure out how we can re-enable > >>>> it, which is more difficult as the counters need to be resynchronized with > >>>> the current state, that'd be ideal. Then, we can just allow each cgroup to > >>>> enable / disable PSI reporting dynamically as they see fit. > >>> > >>> This method is more fine-grained but more difficult like you said above. > >>> I think it may meet most needs to disable PSI stats in intermediate cgroups? > >> > >> So, I'm not necessarily against implementing something easier but we at > >> least wanna get the interface right, so that if we decide to do the full > >> thing later we can easily expand on the existing interface. ie. let's please > >> not be too hacky. I don't think it'd be that difficult to implement > >> per-cgroup disable-only operation that we can later expand to allow > >> re-enabling, right? > > > > It should be relatively straight-forward to disable and re-enable > > state aggregation, time tracking, averaging on a per-cgroup level, if > > we can live with losing history from while it was disabled. I.e. the > > avgs will restart from 0, total= will have gaps - should be okay, IMO. > > > > Where it gets trickier is also stopping the tracking of task counts in > > a cgroup. For re-enabling afterwards, we'd have to freeze scheduler > > and cgroup state and find all tasks of interest across all CPUs for > > the given cgroup to recreate the counts. I'm not quite sure whether > > that's feasible, and if so, whether it's worth the savings. > > > > It might be good to benchmark the two disabling steps independently. > > Maybe stopping aggregation while keeping task counts is good enough, > > and we can commit to a disable/re-enable interface from the start. > > > > Or maybe it's all in the cachelines and iteration, and stopping the > > aggregation while still writing task counts isn't saving much. In that > > case we'd have to look closer at reconstructing task counts, to see if > > later re-enabling is actually a practical option or whether a one-off > > kill switch is more realistic. > > > > Chengming, can you experiment with disabling: record_times(), the > > test_state() loop and state_mask construction, and the averaging > > worker - while keeping the groupc->tasks updates? > > Hello, > > I did this experiment today with disabling record_times(), test_state() > loop and averaging worker, while only keeping groupc->tasks[] updates, > the results look promising. > > mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup: > > perfpipe > tip tip patched > psi=off psi=on only groupc->tasks[] > Min Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) > 1st-qrtle Time 8.11 ( 0.00%) 8.94 ( -10.22%) 8.39 ( -3.46%) > 2nd-qrtle Time 8.17 ( 0.00%) 9.02 ( -10.42%) 8.44 ( -3.37%) > 3rd-qrtle Time 8.20 ( 0.00%) 9.08 ( -10.72%) 8.48 ( -3.43%) > Max-1 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) > Max-5 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) > Max-10 Time 8.09 ( 0.00%) 8.89 ( -9.96%) 8.35 ( -3.22%) > Max-90 Time 8.31 ( 0.00%) 9.13 ( -9.90%) 8.55 ( -2.95%) > Max-95 Time 8.32 ( 0.00%) 9.14 ( -9.88%) 8.55 ( -2.81%) > Max-99 Time 8.39 ( 0.00%) 9.26 ( -10.30%) 8.57 ( -2.09%) > Max Time 8.56 ( 0.00%) 9.26 ( -8.23%) 8.72 ( -1.90%) > Amean Time 8.19 ( 0.00%) 9.03 * -10.26%* 8.45 * -3.27%* Fantastic! > Tejun suggested using a bitmap in task to remember whether the task is accounted > at a given level or not, which I think also is a very good idea, but I haven't > clearly figure out how to do it. > > The above performance test result looks good to me, so I think we can implement this > per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start, > and we can change to a better implementation if needed later? Yes, that sounds good to me. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup @ 2022-08-04 16:56 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-08-04 16:56 UTC (permalink / raw) To: Chengming Zhou Cc: Tejun Heo, surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Aug 04, 2022 at 09:51:31PM +0800, Chengming Zhou wrote: > On 2022/8/4 03:22, Johannes Weiner wrote: > > On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote: > >> Hello, > >> > >> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote: > >>>> Assuming the above isn't wrong, if we can figure out how we can re-enable > >>>> it, which is more difficult as the counters need to be resynchronized with > >>>> the current state, that'd be ideal. Then, we can just allow each cgroup to > >>>> enable / disable PSI reporting dynamically as they see fit. > >>> > >>> This method is more fine-grained but more difficult like you said above. > >>> I think it may meet most needs to disable PSI stats in intermediate cgroups? > >> > >> So, I'm not necessarily against implementing something easier but we at > >> least wanna get the interface right, so that if we decide to do the full > >> thing later we can easily expand on the existing interface. ie. let's please > >> not be too hacky. I don't think it'd be that difficult to implement > >> per-cgroup disable-only operation that we can later expand to allow > >> re-enabling, right? > > > > It should be relatively straight-forward to disable and re-enable > > state aggregation, time tracking, averaging on a per-cgroup level, if > > we can live with losing history from while it was disabled. I.e. the > > avgs will restart from 0, total= will have gaps - should be okay, IMO. > > > > Where it gets trickier is also stopping the tracking of task counts in > > a cgroup. For re-enabling afterwards, we'd have to freeze scheduler > > and cgroup state and find all tasks of interest across all CPUs for > > the given cgroup to recreate the counts. I'm not quite sure whether > > that's feasible, and if so, whether it's worth the savings. > > > > It might be good to benchmark the two disabling steps independently. > > Maybe stopping aggregation while keeping task counts is good enough, > > and we can commit to a disable/re-enable interface from the start. > > > > Or maybe it's all in the cachelines and iteration, and stopping the > > aggregation while still writing task counts isn't saving much. In that > > case we'd have to look closer at reconstructing task counts, to see if > > later re-enabling is actually a practical option or whether a one-off > > kill switch is more realistic. > > > > Chengming, can you experiment with disabling: record_times(), the > > test_state() loop and state_mask construction, and the averaging > > worker - while keeping the groupc->tasks updates? > > Hello, > > I did this experiment today with disabling record_times(), test_state() > loop and averaging worker, while only keeping groupc->tasks[] updates, > the results look promising. > > mmtests/config-scheduler-perfpipe on Intel Xeon Platinum with 3 levels of cgroup: > > perfpipe > tip tip patched > psi=off psi=on only groupc->tasks[] > Min Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) > 1st-qrtle Time 8.11 ( 0.00%) 8.94 ( -10.22%) 8.39 ( -3.46%) > 2nd-qrtle Time 8.17 ( 0.00%) 9.02 ( -10.42%) 8.44 ( -3.37%) > 3rd-qrtle Time 8.20 ( 0.00%) 9.08 ( -10.72%) 8.48 ( -3.43%) > Max-1 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) > Max-5 Time 7.99 ( 0.00%) 8.86 ( -10.95%) 8.31 ( -4.08%) > Max-10 Time 8.09 ( 0.00%) 8.89 ( -9.96%) 8.35 ( -3.22%) > Max-90 Time 8.31 ( 0.00%) 9.13 ( -9.90%) 8.55 ( -2.95%) > Max-95 Time 8.32 ( 0.00%) 9.14 ( -9.88%) 8.55 ( -2.81%) > Max-99 Time 8.39 ( 0.00%) 9.26 ( -10.30%) 8.57 ( -2.09%) > Max Time 8.56 ( 0.00%) 9.26 ( -8.23%) 8.72 ( -1.90%) > Amean Time 8.19 ( 0.00%) 9.03 * -10.26%* 8.45 * -3.27%* Fantastic! > Tejun suggested using a bitmap in task to remember whether the task is accounted > at a given level or not, which I think also is a very good idea, but I haven't > clearly figure out how to do it. > > The above performance test result looks good to me, so I think we can implement this > per-cgroup "cgroup.psi" interface to disable/re-enable PSI stats from the start, > and we can change to a better implementation if needed later? Yes, that sounds good to me. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup 2022-08-03 17:58 ` Tejun Heo 2022-08-03 19:22 ` Johannes Weiner @ 2022-08-04 2:02 ` Chengming Zhou 1 sibling, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-08-04 2:02 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, surenb, mingo, peterz, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/8/4 01:58, Tejun Heo wrote: > Hello, > > On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote: >>> Assuming the above isn't wrong, if we can figure out how we can re-enable >>> it, which is more difficult as the counters need to be resynchronized with >>> the current state, that'd be ideal. Then, we can just allow each cgroup to >>> enable / disable PSI reporting dynamically as they see fit. >> >> This method is more fine-grained but more difficult like you said above. >> I think it may meet most needs to disable PSI stats in intermediate cgroups? > > So, I'm not necessarily against implementing something easier but we at > least wanna get the interface right, so that if we decide to do the full > thing later we can easily expand on the existing interface. ie. let's please > not be too hacky. I don't think it'd be that difficult to implement > per-cgroup disable-only operation that we can later expand to allow > re-enabling, right? Agree, the interface is important, per-cgroup disable-only operation maybe easier to implement. I will look into this more. Thanks! ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou [not found] ` <20220721040439.2651-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> @ 2022-07-21 4:04 ` Chengming Zhou [not found] ` <20220721040439.2651-6-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> 2022-07-21 4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou 3 siblings, 1 reply; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI can be configured to skip per-cgroup stall accounting. And doesn't expose PSI files in cgroup hierarchy. This patch do the same thing when psi_disabled. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- kernel/cgroup/cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..1424da7ed2c4 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3700,6 +3700,9 @@ static void cgroup_pressure_release(struct kernfs_open_file *of) bool cgroup_psi_enabled(void) { + if (static_branch_likely(&psi_disabled)) + return false; + return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0; } -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
[parent not found: <20220721040439.2651-6-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou @ 2022-07-25 16:41 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 16:41 UTC (permalink / raw) To: Chengming Zhou Cc: surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On Thu, Jul 21, 2022 at 12:04:35PM +0800, Chengming Zhou wrote: > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") > make PSI can be configured to skip per-cgroup stall accounting. And > doesn't expose PSI files in cgroup hierarchy. > > This patch do the same thing when psi_disabled. > > Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled @ 2022-07-25 16:41 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 16:41 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:35PM +0800, Chengming Zhou wrote: > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") > make PSI can be configured to skip per-cgroup stall accounting. And > doesn't expose PSI files in cgroup hierarchy. > > This patch do the same thing when psi_disabled. > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou [not found] ` <20220721040439.2651-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> 2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou 2022-07-25 16:47 ` Johannes Weiner 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou 3 siblings, 1 reply; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou We won't use cgroup psi_group when !psi_cgroups_enabled, so don't bother to alloc percpu memory and init for it. Also don't need to migrate task PSI stats between cgroups in cgroup_move_task(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 6ba159fe2a4f..aa40bf888102 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -205,6 +205,7 @@ void __init psi_init(void) { if (!psi_enable) { static_branch_enable(&psi_disabled); + static_branch_disable(&psi_cgroups_enabled); return; } @@ -952,7 +953,7 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); @@ -964,7 +965,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) void psi_cgroup_free(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return; cancel_delayed_work_sync(&cgroup->psi.avgs_work); @@ -991,7 +992,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) struct rq_flags rf; struct rq *rq; - if (static_branch_likely(&psi_disabled)) { + if (!static_branch_likely(&psi_cgroups_enabled)) { /* * Lame to do this here, but the scheduler cannot be locked * from the outside, so we move cgroups from inside sched/. -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled 2022-07-21 4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou @ 2022-07-25 16:47 ` Johannes Weiner 0 siblings, 0 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 16:47 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:36PM +0800, Chengming Zhou wrote: > We won't use cgroup psi_group when !psi_cgroups_enabled, so don't > bother to alloc percpu memory and init for it. > > Also don't need to migrate task PSI stats between cgroups in > cgroup_move_task(). > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou ` (2 preceding siblings ...) 2022-07-21 4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou @ 2022-07-21 4:04 ` Chengming Zhou [not found] ` <20220721040439.2651-10-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> 2022-07-25 18:26 ` Johannes Weiner 3 siblings, 2 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-21 4:04 UTC (permalink / raw) To: hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. For performance impact consideration, this is enabled by default when CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline parameter "psi_irq=". Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- .../admin-guide/kernel-parameters.txt | 5 ++ include/linux/psi.h | 1 + include/linux/psi_types.h | 7 +- kernel/cgroup/cgroup.c | 27 +++++++ kernel/sched/core.c | 1 + kernel/sched/psi.c | 76 ++++++++++++++++++- kernel/sched/stats.h | 13 ++++ 7 files changed, 126 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6beef5b8bc36..1067dde299a0 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4425,6 +4425,11 @@ Format: <bool> default: enabled + psi_irq= [KNL] Enable or disable IRQ/SOFTIRQ pressure stall + information tracking. + Format: <bool> + default: enabled when CONFIG_IRQ_TIME_ACCOUNTING. + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/include/linux/psi.h b/include/linux/psi.h index aa168a038242..f5cf3e45d5a5 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -14,6 +14,7 @@ struct css_set; #ifdef CONFIG_PSI extern struct static_key_false psi_disabled; +extern struct static_key_true psi_irq_enabled; extern struct psi_group psi_system; void psi_init(void); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c124f7d186d0..195f123b1cd1 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -47,7 +47,8 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES = 3, + PSI_IRQ, + NR_PSI_RESOURCES = 4, }; /* @@ -63,9 +64,11 @@ enum psi_states { PSI_MEM_FULL, PSI_CPU_SOME, PSI_CPU_FULL, + PSI_IRQ_SOME, + PSI_IRQ_FULL, /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES = 7, + NR_PSI_STATES = 9, }; enum psi_aggregators { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1424da7ed2c4..cf61df0ac892 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3683,6 +3683,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of, return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp = seq_css(seq)->cgroup; + struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; + + return psi_show(seq, psi, PSI_IRQ); +} + +static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ); +} +#endif + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5079,6 +5096,16 @@ static struct cftype cgroup_base_files[] = { .poll = cgroup_pressure_poll, .release = cgroup_pressure_release, }, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + { + .name = "irq.pressure", + .flags = CFTYPE_PRESSURE, + .seq_show = cgroup_irq_pressure_show, + .write = cgroup_irq_pressure_write, + .poll = cgroup_pressure_poll, + .release = cgroup_pressure_release, + }, +#endif #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5f2d3542b05..08637cfb7ed9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq->prev_irq_time += irq_delta; delta -= irq_delta; + psi_account_irqtime(rq->curr, irq_delta); #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8d76920f47b3..6a0894e28780 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -141,6 +141,7 @@ static int psi_bug __read_mostly; DEFINE_STATIC_KEY_FALSE(psi_disabled); DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled); +DEFINE_STATIC_KEY_TRUE(psi_irq_enabled); #ifdef CONFIG_PSI_DEFAULT_DISABLED static bool psi_enable; @@ -150,6 +151,12 @@ static bool psi_enable = true; static bool psi_inner_cgroup __read_mostly = true; +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static bool psi_irq_enable = true; +#else +static bool psi_irq_enable; +#endif + static int __init setup_psi(char *str) { return kstrtobool(str, &psi_enable) == 0; @@ -162,6 +169,12 @@ static int __init setup_psi_inner_cgroup(char *str) } __setup("psi_inner_cgroup=", setup_psi_inner_cgroup); +static int __init setup_psi_irq(char *str) +{ + return kstrtobool(str, &psi_irq_enable) == 0; +} +__setup("psi_irq=", setup_psi_irq); + /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ @@ -215,12 +228,16 @@ void __init psi_init(void) if (!psi_enable) { static_branch_enable(&psi_disabled); static_branch_disable(&psi_cgroups_enabled); + static_branch_disable(&psi_irq_enabled); return; } if (!cgroup_psi_enabled()) static_branch_disable(&psi_cgroups_enabled); + if (!psi_irq_enable) + static_branch_disable(&psi_irq_enabled); + psi_period = jiffies_to_nsecs(PSI_FREQ); group_init(&psi_system); } @@ -893,6 +910,28 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, } } +void psi_groups_account_irqtime(struct task_struct *task, u32 delta) +{ + struct psi_group *group = task_psi_group(task); + int cpu = task_cpu(task); + u64 now = cpu_clock(cpu); + struct psi_group_cpu *groupc; + + for_each_psi_group(group) { + groupc = per_cpu_ptr(group->pcpu, cpu); + + write_seqcount_begin(&groupc->seq); + + record_times(groupc, now); + groupc->times[PSI_IRQ_FULL] += delta; + + write_seqcount_end(&groupc->seq); + + if (group->poll_states & (1 << PSI_IRQ_FULL)) + psi_schedule_poll_work(group, 1); + } +} + /** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections @@ -1069,7 +1108,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) group->avg_next_update = update_averages(group, now); mutex_unlock(&group->avgs_lock); - for (full = 0; full < 2; full++) { + for (full = (res == PSI_IRQ); full < 2; full++) { unsigned long avg[3] = { 0, }; u64 total = 0; int w; @@ -1111,9 +1150,12 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, else return ERR_PTR(-EINVAL); - if (state >= PSI_NONIDLE) + if (state >= PSI_NONIDLE || state == PSI_IRQ_SOME) return ERR_PTR(-EINVAL); + if (!static_branch_likely(&psi_irq_enabled) && state == PSI_IRQ_FULL) + return ERR_PTR(-EOPNOTSUPP); + if (window_us < WINDOW_MIN_US || window_us > WINDOW_MAX_US) return ERR_PTR(-EINVAL); @@ -1395,6 +1437,33 @@ static const struct proc_ops psi_cpu_proc_ops = { .proc_release = psi_fop_release, }; +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int psi_irq_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IRQ); +} + +static int psi_irq_open(struct inode *inode, struct file *file) +{ + return psi_open(file, psi_irq_show); +} + +static ssize_t psi_irq_write(struct file *file, const char __user *user_buf, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_IRQ); +} + +static const struct proc_ops psi_irq_proc_ops = { + .proc_open = psi_irq_open, + .proc_read = seq_read, + .proc_lseek = seq_lseek, + .proc_write = psi_irq_write, + .proc_poll = psi_fop_poll, + .proc_release = psi_fop_release, +}; +#endif + static int __init psi_proc_init(void) { if (psi_enable) { @@ -1402,6 +1471,9 @@ static int __init psi_proc_init(void) proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops); +#endif } return 0; } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index e930b8fa6253..10926cdaccc8 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -111,6 +111,7 @@ void psi_change_groups(struct task_struct *task, int clear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); +void psi_groups_account_irqtime(struct task_struct *task, u32 delta); /* * PSI tracks state that persists across sleeps, such as iowaits and @@ -196,6 +197,17 @@ static inline void psi_sched_switch(struct task_struct *prev, psi_task_switch(prev, next, sleep); } +static inline void psi_account_irqtime(struct task_struct *task, u32 delta) +{ + if (!static_branch_likely(&psi_irq_enabled)) + return; + + if (!task->pid) + return; + + psi_groups_account_irqtime(task, delta); +} + #else /* CONFIG_PSI */ static inline void psi_enqueue(struct task_struct *p, bool wakeup) {} static inline void psi_dequeue(struct task_struct *p, bool sleep) {} @@ -203,6 +215,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {} static inline void psi_sched_switch(struct task_struct *prev, struct task_struct *next, bool sleep) {} +static inline void psi_account_irqtime(struct task_struct *curr, u32 delta) {} #endif /* CONFIG_PSI */ #ifdef CONFIG_SCHED_INFO -- 2.36.1 ^ permalink raw reply related [flat|nested] 56+ messages in thread
[parent not found: <20220721040439.2651-10-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou @ 2022-07-21 10:00 ` kernel test robot 2022-07-25 18:26 ` Johannes Weiner 1 sibling, 0 replies; 56+ messages in thread From: kernel test robot @ 2022-07-21 10:00 UTC (permalink / raw) To: Chengming Zhou, hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: kbuild-all-hn68Rpc1hR1g9hUCZPvPmw, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou Hi Chengming, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7] [cannot apply to tj-cgroup/for-next next-20220720] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4 config: riscv-randconfig-s032-20220718 (https://download.01.org/0day-ci/archive/20220721/202207211726.ilPYe7AO-lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org/config) compiler: riscv64-linux-gcc (GCC) 12.1.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.4-39-gce1a6720-dirty # https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=riscv SHELL=/bin/bash kernel/sched/ If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> sparse warnings: (new ones prefixed by >>) >> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *task @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:711:31: sparse: expected struct task_struct *task kernel/sched/core.c:711:31: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:1028:38: sparse: expected struct task_struct *curr kernel/sched/core.c:1028:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:33: sparse: expected struct task_struct *p kernel/sched/core.c:2192:33: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:68: sparse: expected struct task_struct *tsk kernel/sched/core.c:2192:68: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:3592:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/core.c:3592:17: sparse: expected struct sched_domain *[assigned] sd kernel/sched/core.c:3592:17: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/core.c:3789:28: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct const *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:3789:28: sparse: expected struct task_struct const *p kernel/sched/core.c:3789:28: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:9089:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *push_task @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:9089:43: sparse: expected struct task_struct *push_task kernel/sched/core.c:9089:43: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:5376:38: sparse: expected struct task_struct *curr kernel/sched/core.c:5376:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *prev @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:6322:14: sparse: expected struct task_struct *prev kernel/sched/core.c:6322:14: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:6848:17: sparse: struct task_struct * kernel/sched/core.c:6848:17: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:7064:22: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: struct task_struct * kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:11121:25: sparse: expected struct task_struct *p kernel/sched/core.c:11121:25: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit kernel/sched/core.c:562:6: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock kernel/sched/core.c: note: in included file: kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit kernel/sched/core.c: note: in included file: kernel/sched/pelt.h:97:13: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct const *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/pelt.h:97:13: sparse: expected struct task_struct const *p kernel/sched/pelt.h:97:13: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression kernel/sched/core.c: note: in included file: kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/core.c:2158:38: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:2158:38: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:2158:38: sparse: struct task_struct const * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * vim +711 kernel/sched/core.c 675 676 /* 677 * RQ-clock updating methods: 678 */ 679 680 static void update_rq_clock_task(struct rq *rq, s64 delta) 681 { 682 /* 683 * In theory, the compile should just see 0 here, and optimize out the call 684 * to sched_rt_avg_update. But I don't trust it... 685 */ 686 s64 __maybe_unused steal = 0, irq_delta = 0; 687 688 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 689 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; 690 691 /* 692 * Since irq_time is only updated on {soft,}irq_exit, we might run into 693 * this case when a previous update_rq_clock() happened inside a 694 * {soft,}irq region. 695 * 696 * When this happens, we stop ->clock_task and only update the 697 * prev_irq_time stamp to account for the part that fit, so that a next 698 * update will consume the rest. This ensures ->clock_task is 699 * monotonic. 700 * 701 * It does however cause some slight miss-attribution of {soft,}irq 702 * time, a more accurate solution would be to update the irq_time using 703 * the current rq->clock timestamp, except that would require using 704 * atomic ops. 705 */ 706 if (irq_delta > delta) 707 irq_delta = delta; 708 709 rq->prev_irq_time += irq_delta; 710 delta -= irq_delta; > 711 psi_account_irqtime(rq->curr, irq_delta); 712 #endif 713 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING 714 if (static_key_false((¶virt_steal_rq_enabled))) { 715 steal = paravirt_steal_clock(cpu_of(rq)); 716 steal -= rq->prev_steal_time_rq; 717 718 if (unlikely(steal > delta)) 719 steal = delta; 720 721 rq->prev_steal_time_rq += steal; 722 delta -= steal; 723 } 724 #endif 725 726 rq->clock_task += delta; 727 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure @ 2022-07-21 10:00 ` kernel test robot 0 siblings, 0 replies; 56+ messages in thread From: kernel test robot @ 2022-07-21 10:00 UTC (permalink / raw) To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: kbuild-all, linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou Hi Chengming, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7] [cannot apply to tj-cgroup/for-next next-20220720] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4 config: riscv-randconfig-s032-20220718 (https://download.01.org/0day-ci/archive/20220721/202207211726.ilPYe7AO-lkp@intel.com/config) compiler: riscv64-linux-gcc (GCC) 12.1.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.4-39-gce1a6720-dirty # https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=riscv SHELL=/bin/bash kernel/sched/ If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) >> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *task @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:711:31: sparse: expected struct task_struct *task kernel/sched/core.c:711:31: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:1028:38: sparse: expected struct task_struct *curr kernel/sched/core.c:1028:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:33: sparse: expected struct task_struct *p kernel/sched/core.c:2192:33: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:68: sparse: expected struct task_struct *tsk kernel/sched/core.c:2192:68: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:3592:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/core.c:3592:17: sparse: expected struct sched_domain *[assigned] sd kernel/sched/core.c:3592:17: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/core.c:3789:28: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct const *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:3789:28: sparse: expected struct task_struct const *p kernel/sched/core.c:3789:28: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:9089:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *push_task @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:9089:43: sparse: expected struct task_struct *push_task kernel/sched/core.c:9089:43: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:5376:38: sparse: expected struct task_struct *curr kernel/sched/core.c:5376:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *prev @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:6322:14: sparse: expected struct task_struct *prev kernel/sched/core.c:6322:14: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:6848:17: sparse: struct task_struct * kernel/sched/core.c:6848:17: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:7064:22: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: struct task_struct * kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:11121:25: sparse: expected struct task_struct *p kernel/sched/core.c:11121:25: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit kernel/sched/core.c:562:6: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock kernel/sched/core.c: note: in included file: kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit kernel/sched/sched.h:1580:9: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit kernel/sched/core.c: note: in included file: kernel/sched/pelt.h:97:13: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct const *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/pelt.h:97:13: sparse: expected struct task_struct const *p kernel/sched/pelt.h:97:13: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression kernel/sched/core.c: note: in included file: kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/core.c:2158:38: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:2158:38: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:2158:38: sparse: struct task_struct const * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * vim +711 kernel/sched/core.c 675 676 /* 677 * RQ-clock updating methods: 678 */ 679 680 static void update_rq_clock_task(struct rq *rq, s64 delta) 681 { 682 /* 683 * In theory, the compile should just see 0 here, and optimize out the call 684 * to sched_rt_avg_update. But I don't trust it... 685 */ 686 s64 __maybe_unused steal = 0, irq_delta = 0; 687 688 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 689 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; 690 691 /* 692 * Since irq_time is only updated on {soft,}irq_exit, we might run into 693 * this case when a previous update_rq_clock() happened inside a 694 * {soft,}irq region. 695 * 696 * When this happens, we stop ->clock_task and only update the 697 * prev_irq_time stamp to account for the part that fit, so that a next 698 * update will consume the rest. This ensures ->clock_task is 699 * monotonic. 700 * 701 * It does however cause some slight miss-attribution of {soft,}irq 702 * time, a more accurate solution would be to update the irq_time using 703 * the current rq->clock timestamp, except that would require using 704 * atomic ops. 705 */ 706 if (irq_delta > delta) 707 irq_delta = delta; 708 709 rq->prev_irq_time += irq_delta; 710 delta -= irq_delta; > 711 psi_account_irqtime(rq->curr, irq_delta); 712 #endif 713 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING 714 if (static_key_false((¶virt_steal_rq_enabled))) { 715 steal = paravirt_steal_clock(cpu_of(rq)); 716 steal -= rq->prev_steal_time_rq; 717 718 if (unlikely(steal > delta)) 719 steal = delta; 720 721 rq->prev_steal_time_rq += steal; 722 delta -= steal; 723 } 724 #endif 725 726 rq->clock_task += delta; 727 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou @ 2022-07-21 22:10 ` kernel test robot 2022-07-25 18:26 ` Johannes Weiner 1 sibling, 0 replies; 56+ messages in thread From: kernel test robot @ 2022-07-21 22:10 UTC (permalink / raw) To: Chengming Zhou, hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: kbuild-all-hn68Rpc1hR1g9hUCZPvPmw, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA, Chengming Zhou Hi Chengming, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7] [cannot apply to tj-cgroup/for-next next-20220721] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4 config: x86_64-randconfig-s022-20220718 (https://download.01.org/0day-ci/archive/20220722/202207220642.sbCB4Bcf-lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org/config) compiler: gcc-11 (Debian 11.3.0-3) 11.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.4-39-gce1a6720-dirty # https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189 # save the config file mkdir build_dir && cp config build_dir/.config make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> sparse warnings: (new ones prefixed by >>) >> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:711:31: sparse: expected struct task_struct *curr kernel/sched/core.c:711:31: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:781:48: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:781:48: sparse: expected struct task_struct *p kernel/sched/core.c:781:48: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:1028:38: sparse: expected struct task_struct *curr kernel/sched/core.c:1028:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:33: sparse: expected struct task_struct *p kernel/sched/core.c:2192:33: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:68: sparse: expected struct task_struct *tsk kernel/sched/core.c:2192:68: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:5376:38: sparse: expected struct task_struct *curr kernel/sched/core.c:5376:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *prev @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:6322:14: sparse: expected struct task_struct *prev kernel/sched/core.c:6322:14: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:6848:17: sparse: struct task_struct * kernel/sched/core.c:6848:17: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:7064:22: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: struct task_struct * kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:11121:25: sparse: expected struct task_struct *p kernel/sched/core.c:11121:25: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit kernel/sched/core.c:570:23: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock kernel/sched/core.c:624:36: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit kernel/sched/core.c:665:36: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit kernel/sched/core.c:781:11: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression kernel/sched/core.c: note: in included file: kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'ttwu_runnable' - wrong count at exit kernel/sched/core.c:4262:9: sparse: sparse: context imbalance in 'task_call_func' - wrong count at exit kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'wake_up_new_task' - wrong count at exit kernel/sched/core.c:5035:9: sparse: sparse: context imbalance in 'finish_task_switch' - wrong count at exit kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * vim +711 kernel/sched/core.c 675 676 /* 677 * RQ-clock updating methods: 678 */ 679 680 static void update_rq_clock_task(struct rq *rq, s64 delta) 681 { 682 /* 683 * In theory, the compile should just see 0 here, and optimize out the call 684 * to sched_rt_avg_update. But I don't trust it... 685 */ 686 s64 __maybe_unused steal = 0, irq_delta = 0; 687 688 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 689 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; 690 691 /* 692 * Since irq_time is only updated on {soft,}irq_exit, we might run into 693 * this case when a previous update_rq_clock() happened inside a 694 * {soft,}irq region. 695 * 696 * When this happens, we stop ->clock_task and only update the 697 * prev_irq_time stamp to account for the part that fit, so that a next 698 * update will consume the rest. This ensures ->clock_task is 699 * monotonic. 700 * 701 * It does however cause some slight miss-attribution of {soft,}irq 702 * time, a more accurate solution would be to update the irq_time using 703 * the current rq->clock timestamp, except that would require using 704 * atomic ops. 705 */ 706 if (irq_delta > delta) 707 irq_delta = delta; 708 709 rq->prev_irq_time += irq_delta; 710 delta -= irq_delta; > 711 psi_account_irqtime(rq->curr, irq_delta); 712 #endif 713 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING 714 if (static_key_false((¶virt_steal_rq_enabled))) { 715 steal = paravirt_steal_clock(cpu_of(rq)); 716 steal -= rq->prev_steal_time_rq; 717 718 if (unlikely(steal > delta)) 719 steal = delta; 720 721 rq->prev_steal_time_rq += steal; 722 delta -= steal; 723 } 724 #endif 725 726 rq->clock_task += delta; 727 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure @ 2022-07-21 22:10 ` kernel test robot 0 siblings, 0 replies; 56+ messages in thread From: kernel test robot @ 2022-07-21 22:10 UTC (permalink / raw) To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: kbuild-all, linux-doc, linux-kernel, songmuchun, cgroups, Chengming Zhou Hi Chengming, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on akpm-mm/mm-everything linus/master v5.19-rc7] [cannot apply to tj-cgroup/for-next next-20220721] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 401e4963bf45c800e3e9ea0d3a0289d738005fd4 config: x86_64-randconfig-s022-20220718 (https://download.01.org/0day-ci/archive/20220722/202207220642.sbCB4Bcf-lkp@intel.com/config) compiler: gcc-11 (Debian 11.3.0-3) 11.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.4-39-gce1a6720-dirty # https://github.com/intel-lab-lkp/linux/commit/d14f2a9ff31fefc5b28a16addaa832dc80d84189 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Chengming-Zhou/sched-psi-some-optimization-and-extension/20220721-120833 git checkout d14f2a9ff31fefc5b28a16addaa832dc80d84189 # save the config file mkdir build_dir && cp config build_dir/.config make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) >> kernel/sched/core.c:711:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:711:31: sparse: expected struct task_struct *curr kernel/sched/core.c:711:31: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:781:48: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:781:48: sparse: expected struct task_struct *p kernel/sched/core.c:781:48: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:1028:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:1028:38: sparse: expected struct task_struct *curr kernel/sched/core.c:1028:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:33: sparse: expected struct task_struct *p kernel/sched/core.c:2192:33: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:2192:68: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:2192:68: sparse: expected struct task_struct *tsk kernel/sched/core.c:2192:68: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:5376:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:5376:38: sparse: expected struct task_struct *curr kernel/sched/core.c:5376:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6322:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *prev @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:6322:14: sparse: expected struct task_struct *prev kernel/sched/core.c:6322:14: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:6848:17: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:6848:17: sparse: struct task_struct * kernel/sched/core.c:6848:17: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/core.c:7064:22: sparse: struct task_struct [noderef] __rcu * kernel/sched/core.c:7064:22: sparse: struct task_struct * kernel/sched/core.c:11121:25: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/core.c:11121:25: sparse: expected struct task_struct *p kernel/sched/core.c:11121:25: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/core.c:537:6: sparse: sparse: context imbalance in 'raw_spin_rq_lock_nested' - wrong count at exit kernel/sched/core.c:570:23: sparse: sparse: context imbalance in 'raw_spin_rq_trylock' - wrong count at exit kernel/sched/core.c:586:6: sparse: sparse: context imbalance in 'raw_spin_rq_unlock' - unexpected unlock kernel/sched/core.c:624:36: sparse: sparse: context imbalance in '__task_rq_lock' - wrong count at exit kernel/sched/core.c:665:36: sparse: sparse: context imbalance in 'task_rq_lock' - wrong count at exit kernel/sched/core.c:781:11: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2183:33: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2184:19: sparse: sparse: dereference of noderef expression kernel/sched/core.c:2185:18: sparse: sparse: dereference of noderef expression kernel/sched/core.c: note: in included file: kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'ttwu_runnable' - wrong count at exit kernel/sched/core.c:4262:9: sparse: sparse: context imbalance in 'task_call_func' - wrong count at exit kernel/sched/sched.h:1592:9: sparse: sparse: context imbalance in 'wake_up_new_task' - wrong count at exit kernel/sched/core.c:5035:9: sparse: sparse: context imbalance in 'finish_task_switch' - wrong count at exit kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * kernel/sched/sched.h:2210:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2210:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2210:9: sparse: struct task_struct * kernel/sched/sched.h:2053:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2053:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2053:25: sparse: struct task_struct * vim +711 kernel/sched/core.c 675 676 /* 677 * RQ-clock updating methods: 678 */ 679 680 static void update_rq_clock_task(struct rq *rq, s64 delta) 681 { 682 /* 683 * In theory, the compile should just see 0 here, and optimize out the call 684 * to sched_rt_avg_update. But I don't trust it... 685 */ 686 s64 __maybe_unused steal = 0, irq_delta = 0; 687 688 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 689 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; 690 691 /* 692 * Since irq_time is only updated on {soft,}irq_exit, we might run into 693 * this case when a previous update_rq_clock() happened inside a 694 * {soft,}irq region. 695 * 696 * When this happens, we stop ->clock_task and only update the 697 * prev_irq_time stamp to account for the part that fit, so that a next 698 * update will consume the rest. This ensures ->clock_task is 699 * monotonic. 700 * 701 * It does however cause some slight miss-attribution of {soft,}irq 702 * time, a more accurate solution would be to update the irq_time using 703 * the current rq->clock timestamp, except that would require using 704 * atomic ops. 705 */ 706 if (irq_delta > delta) 707 irq_delta = delta; 708 709 rq->prev_irq_time += irq_delta; 710 delta -= irq_delta; > 711 psi_account_irqtime(rq->curr, irq_delta); 712 #endif 713 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING 714 if (static_key_false((¶virt_steal_rq_enabled))) { 715 steal = paravirt_steal_clock(cpu_of(rq)); 716 steal -= rq->prev_steal_time_rq; 717 718 if (unlikely(steal > delta)) 719 steal = delta; 720 721 rq->prev_steal_time_rq += steal; 722 delta -= steal; 723 } 724 #endif 725 726 rq->clock_task += delta; 727 -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou @ 2022-07-22 3:30 ` Abel Wu 2022-07-25 18:26 ` Johannes Weiner 1 sibling, 0 replies; 56+ messages in thread From: Abel Wu @ 2022-07-22 3:30 UTC (permalink / raw) To: Chengming Zhou, hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA Hi Chengming, On 7/21/22 12:04 PM, Chengming Zhou Wrote: > Now PSI already tracked workload pressure stall information for > CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have > obvious impact on some workload productivity, such as web service > workload. > > When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time > from update_rq_clock_task(), in which we can record that delta > to CPU curr task's cgroups as PSI_IRQ_FULL status. The {soft,}irq affection should be equal to all the runnable tasks on that cpu, not only rq->curr. Further I think irqstall is per-cpu rather than per-cgroup. Abel ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure @ 2022-07-22 3:30 ` Abel Wu 0 siblings, 0 replies; 56+ messages in thread From: Abel Wu @ 2022-07-22 3:30 UTC (permalink / raw) To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups Hi Chengming, On 7/21/22 12:04 PM, Chengming Zhou Wrote: > Now PSI already tracked workload pressure stall information for > CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have > obvious impact on some workload productivity, such as web service > workload. > > When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time > from update_rq_clock_task(), in which we can record that delta > to CPU curr task's cgroups as PSI_IRQ_FULL status. The {soft,}irq affection should be equal to all the runnable tasks on that cpu, not only rq->curr. Further I think irqstall is per-cpu rather than per-cgroup. Abel ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <65d9f79b-be9b-e21e-0624-5c9f2cc0c0b2-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-22 3:30 ` Abel Wu @ 2022-07-22 6:13 ` Chengming Zhou -1 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-22 6:13 UTC (permalink / raw) To: Abel Wu, hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On 2022/7/22 11:30, Abel Wu wrote: > Hi Chengming, > > On 7/21/22 12:04 PM, Chengming Zhou Wrote: >> Now PSI already tracked workload pressure stall information for >> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >> obvious impact on some workload productivity, such as web service >> workload. >> >> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >> from update_rq_clock_task(), in which we can record that delta >> to CPU curr task's cgroups as PSI_IRQ_FULL status. > > The {soft,}irq affection should be equal to all the runnable tasks > on that cpu, not only rq->curr. Further I think irqstall is per-cpu > rather than per-cgroup. Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL. So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system. Thanks. > > Abel ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure @ 2022-07-22 6:13 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-22 6:13 UTC (permalink / raw) To: Abel Wu, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/22 11:30, Abel Wu wrote: > Hi Chengming, > > On 7/21/22 12:04 PM, Chengming Zhou Wrote: >> Now PSI already tracked workload pressure stall information for >> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >> obvious impact on some workload productivity, such as web service >> workload. >> >> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >> from update_rq_clock_task(), in which we can record that delta >> to CPU curr task's cgroups as PSI_IRQ_FULL status. > > The {soft,}irq affection should be equal to all the runnable tasks > on that cpu, not only rq->curr. Further I think irqstall is per-cpu > rather than per-cgroup. Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL. So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system. Thanks. > > Abel ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-22 6:13 ` Chengming Zhou (?) @ 2022-07-22 7:14 ` Abel Wu [not found] ` <5e5d41e2-5f89-8c52-11e5-0c55c5595a88-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> -1 siblings, 1 reply; 56+ messages in thread From: Abel Wu @ 2022-07-22 7:14 UTC (permalink / raw) To: Chengming Zhou, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups On 7/22/22 2:13 PM, Chengming Zhou Wrote: > On 2022/7/22 11:30, Abel Wu wrote: >> Hi Chengming, >> >> On 7/21/22 12:04 PM, Chengming Zhou Wrote: >>> Now PSI already tracked workload pressure stall information for >>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >>> obvious impact on some workload productivity, such as web service >>> workload. >>> >>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >>> from update_rq_clock_task(), in which we can record that delta >>> to CPU curr task's cgroups as PSI_IRQ_FULL status. >> >> The {soft,}irq affection should be equal to all the runnable tasks >> on that cpu, not only rq->curr. Further I think irqstall is per-cpu >> rather than per-cgroup. > > Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time > and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL. I don't think rq->curr pays for it if you mean consuming quota here. And it doesn't seem appropriate to let other groups treat it as cpu stall because the rq->curr is also the victim rather than the one causes stall (so it's different from rq->curr causing memstall and observed as cpustall by others). > > So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups > as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system. > ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <5e5d41e2-5f89-8c52-11e5-0c55c5595a88-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-22 7:14 ` Abel Wu @ 2022-07-22 7:33 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-22 7:33 UTC (permalink / raw) To: Abel Wu, hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On 2022/7/22 15:14, Abel Wu wrote: > On 7/22/22 2:13 PM, Chengming Zhou Wrote: >> On 2022/7/22 11:30, Abel Wu wrote: >>> Hi Chengming, >>> >>> On 7/21/22 12:04 PM, Chengming Zhou Wrote: >>>> Now PSI already tracked workload pressure stall information for >>>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >>>> obvious impact on some workload productivity, such as web service >>>> workload. >>>> >>>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >>>> from update_rq_clock_task(), in which we can record that delta >>>> to CPU curr task's cgroups as PSI_IRQ_FULL status. >>> >>> The {soft,}irq affection should be equal to all the runnable tasks >>> on that cpu, not only rq->curr. Further I think irqstall is per-cpu >>> rather than per-cgroup. >> >> Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time >> and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL. > > I don't think rq->curr pays for it if you mean consuming quota here. Yes, it makes rq->curr's groups look more productive than it actually is, which are clearly different from other groups. > And it doesn't seem appropriate to let other groups treat it as cpu > stall because the rq->curr is also the victim rather than the one > causes stall (so it's different from rq->curr causing memstall and > observed as cpustall by others). IMHO, we don't care who causes stall, instead we care about what affects workload productivity. > >> >> So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups >> as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system. >> > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure @ 2022-07-22 7:33 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-22 7:33 UTC (permalink / raw) To: Abel Wu, hannes, surenb, mingo, peterz, tj, corbet, akpm, rdunlap Cc: linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/22 15:14, Abel Wu wrote: > On 7/22/22 2:13 PM, Chengming Zhou Wrote: >> On 2022/7/22 11:30, Abel Wu wrote: >>> Hi Chengming, >>> >>> On 7/21/22 12:04 PM, Chengming Zhou Wrote: >>>> Now PSI already tracked workload pressure stall information for >>>> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >>>> obvious impact on some workload productivity, such as web service >>>> workload. >>>> >>>> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >>>> from update_rq_clock_task(), in which we can record that delta >>>> to CPU curr task's cgroups as PSI_IRQ_FULL status. >>> >>> The {soft,}irq affection should be equal to all the runnable tasks >>> on that cpu, not only rq->curr. Further I think irqstall is per-cpu >>> rather than per-cgroup. >> >> Although IRQ/SOFTIRQ is per-cpu, it's the rq->curr who own the CPU at the time >> and pay for it, meanwhile other groups would be thought as PSI_CPU_FULL. > > I don't think rq->curr pays for it if you mean consuming quota here. Yes, it makes rq->curr's groups look more productive than it actually is, which are clearly different from other groups. > And it doesn't seem appropriate to let other groups treat it as cpu > stall because the rq->curr is also the victim rather than the one > causes stall (so it's different from rq->curr causing memstall and > observed as cpustall by others). IMHO, we don't care who causes stall, instead we care about what affects workload productivity. > >> >> So I think it's reasonable to account this IRQ/SOFTIRQ delta to rq->curr's groups >> as PSI_IRQ_FULL pressure stall. And per-cpu IRQ stall can also get from psi_system. >> > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou @ 2022-07-27 16:07 ` Peter Zijlstra 2022-07-25 18:26 ` Johannes Weiner 1 sibling, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2022-07-27 16:07 UTC (permalink / raw) To: Chengming Zhou Cc: hannes-druUgvl0LCNAfugRpC6u6w, surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA, tj-DgEjT+Ai2ygdnm+yROfE0A, corbet-T1hC0tSOHrs, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, songmuchun-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote: > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index c124f7d186d0..195f123b1cd1 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -47,7 +47,8 @@ enum psi_res { > PSI_IO, > PSI_MEM, > PSI_CPU, > - NR_PSI_RESOURCES = 3, > + PSI_IRQ, > + NR_PSI_RESOURCES = 4, > }; > > /* > @@ -63,9 +64,11 @@ enum psi_states { > PSI_MEM_FULL, > PSI_CPU_SOME, > PSI_CPU_FULL, > + PSI_IRQ_SOME, > + PSI_IRQ_FULL, > /* Only per-CPU, to weigh the CPU in the global average: */ > PSI_NONIDLE, > - NR_PSI_STATES = 7, > + NR_PSI_STATES = 9, > }; > > enum psi_aggregators { $ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o struct psi_group_cpu { /* typedef seqcount_t */ struct seqcount { unsigned int sequence; /* 0 4 */ } seq __attribute__((__aligned__(64))); /* 0 4 */ unsigned int tasks[5]; /* 4 20 */ /* typedef u32 -> __u32 */ unsigned int state_mask; /* 24 4 */ /* typedef u32 -> __u32 */ unsigned int times[7]; /* 28 28 */ /* typedef u64 -> __u64 */ long long unsigned int state_start; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* typedef u32 -> __u32 */ unsigned int times_prev[2][7] __attribute__((__aligned__(64))); /* 64 56 */ /* size: 128, cachelines: 2, members: 6 */ /* padding: 8 */ /* forced alignments: 2 */ } __attribute__((__aligned__(64))); $ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o struct psi_group_cpu { /* typedef seqcount_t */ struct seqcount { unsigned int sequence; /* 0 4 */ } seq __attribute__((__aligned__(64))); /* 0 4 */ unsigned int tasks[5]; /* 4 20 */ /* typedef u32 -> __u32 */ unsigned int state_mask; /* 24 4 */ /* typedef u32 -> __u32 */ unsigned int times[9]; /* 28 36 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* typedef u64 -> __u64 */ long long unsigned int state_start; /* 64 8 */ /* XXX 56 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ /* typedef u32 -> __u32 */ unsigned int times_prev[2][9] __attribute__((__aligned__(64))); /* 128 72 */ /* size: 256, cachelines: 4, members: 6 */ /* sum members: 144, holes: 1, sum holes: 56 */ /* padding: 56 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */ } __attribute__((__aligned__(64))); So yeah, I think not. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure @ 2022-07-27 16:07 ` Peter Zijlstra 0 siblings, 0 replies; 56+ messages in thread From: Peter Zijlstra @ 2022-07-27 16:07 UTC (permalink / raw) To: Chengming Zhou Cc: hannes, surenb, mingo, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote: > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index c124f7d186d0..195f123b1cd1 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -47,7 +47,8 @@ enum psi_res { > PSI_IO, > PSI_MEM, > PSI_CPU, > - NR_PSI_RESOURCES = 3, > + PSI_IRQ, > + NR_PSI_RESOURCES = 4, > }; > > /* > @@ -63,9 +64,11 @@ enum psi_states { > PSI_MEM_FULL, > PSI_CPU_SOME, > PSI_CPU_FULL, > + PSI_IRQ_SOME, > + PSI_IRQ_FULL, > /* Only per-CPU, to weigh the CPU in the global average: */ > PSI_NONIDLE, > - NR_PSI_STATES = 7, > + NR_PSI_STATES = 9, > }; > > enum psi_aggregators { $ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o struct psi_group_cpu { /* typedef seqcount_t */ struct seqcount { unsigned int sequence; /* 0 4 */ } seq __attribute__((__aligned__(64))); /* 0 4 */ unsigned int tasks[5]; /* 4 20 */ /* typedef u32 -> __u32 */ unsigned int state_mask; /* 24 4 */ /* typedef u32 -> __u32 */ unsigned int times[7]; /* 28 28 */ /* typedef u64 -> __u64 */ long long unsigned int state_start; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* typedef u32 -> __u32 */ unsigned int times_prev[2][7] __attribute__((__aligned__(64))); /* 64 56 */ /* size: 128, cachelines: 2, members: 6 */ /* padding: 8 */ /* forced alignments: 2 */ } __attribute__((__aligned__(64))); $ pahole -EC psi_group_cpu defconfig-build/kernel/sched/build_utility.o struct psi_group_cpu { /* typedef seqcount_t */ struct seqcount { unsigned int sequence; /* 0 4 */ } seq __attribute__((__aligned__(64))); /* 0 4 */ unsigned int tasks[5]; /* 4 20 */ /* typedef u32 -> __u32 */ unsigned int state_mask; /* 24 4 */ /* typedef u32 -> __u32 */ unsigned int times[9]; /* 28 36 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* typedef u64 -> __u64 */ long long unsigned int state_start; /* 64 8 */ /* XXX 56 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ /* typedef u32 -> __u32 */ unsigned int times_prev[2][9] __attribute__((__aligned__(64))); /* 128 72 */ /* size: 256, cachelines: 4, members: 6 */ /* sum members: 144, holes: 1, sum holes: 56 */ /* padding: 56 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */ } __attribute__((__aligned__(64))); So yeah, I think not. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou [not found] ` <20220721040439.2651-10-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org> @ 2022-07-25 18:26 ` Johannes Weiner 2022-07-26 13:55 ` [External] " Chengming Zhou 2022-07-27 11:28 ` Chengming Zhou 1 sibling, 2 replies; 56+ messages in thread From: Johannes Weiner @ 2022-07-25 18:26 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote: > Now PSI already tracked workload pressure stall information for > CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have > obvious impact on some workload productivity, such as web service > workload. > > When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time > from update_rq_clock_task(), in which we can record that delta > to CPU curr task's cgroups as PSI_IRQ_FULL status. > > Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in > the current task on the CPU, make nothing productive could run > even if it were runnable, so we only use PSI_IRQ_FULL. That sounds reasonable. > For performance impact consideration, this is enabled by default when > CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline > parameter "psi_irq=". If there isn't a concrete usecase already, let's not add another commandline parameter for now. > @@ -63,9 +64,11 @@ enum psi_states { > PSI_MEM_FULL, > PSI_CPU_SOME, > PSI_CPU_FULL, > + PSI_IRQ_SOME, > + PSI_IRQ_FULL, > /* Only per-CPU, to weigh the CPU in the global average: */ > PSI_NONIDLE, > - NR_PSI_STATES = 7, > + NR_PSI_STATES = 9, > }; Unfortunately, this grows the psi state touched by the scheduler into a second cacheline. :( Please reclaim space first. I think we can remove the NR_CPU task count, which frees up one u32. Something like the below diff should work (untested!) And you should be able to remove PSI_IRQ_SOME, since it's not used anyway. Then we'd be good to go. diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c7fe7c089718..31dc76e2d8ea 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -15,13 +15,6 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - /* - * This can't have values other than 0 or 1 and could be - * implemented as a bit flag. But for now we still have room - * in the first cacheline of psi_group_cpu, and this way we - * don't have to special case any state tracking for it. - */ - NR_ONCPU, /* * For IO and CPU stalls the presence of running/oncpu tasks * in the domain means a partial rather than a full stall. @@ -39,9 +32,11 @@ enum psi_task_count { #define TSK_IOWAIT (1 << NR_IOWAIT) #define TSK_MEMSTALL (1 << NR_MEMSTALL) #define TSK_RUNNING (1 << NR_RUNNING) -#define TSK_ONCPU (1 << NR_ONCPU) #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING) +/* Only one task can be scheduled, no corresponding task count */ +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS) + /* Resources that workloads could be stalled on */ enum psi_res { PSI_IO, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a4fa3aadfcba..232e1dbfad46 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -215,7 +215,7 @@ void __init psi_init(void) group_init(&psi_system); } -static bool test_state(unsigned int *tasks, enum psi_states state) +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu) { switch (state) { case PSI_IO_SOME: @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state) return unlikely(tasks[NR_MEMSTALL] && tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); case PSI_CPU_SOME: - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] > oncpu); case PSI_CPU_FULL: - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] && !oncpu); case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu, bool wake_clock) { struct psi_group_cpu *groupc; - u32 state_mask = 0; unsigned int t, m; enum psi_states s; + u32 state_mask; groupc = per_cpu_ptr(group->pcpu, cpu); @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu, record_times(groupc, now); + /* + * Start with TSK_ONCPU, which doesn't have a corresponding + * task count - it's just a boolean flag directly encoded in + * the state mask. Clear, set, or carry the current state if + * no changes are requested. + */ + if (clear & TSK_ONCPU) { + state_mask = 0; + clear &= ~TSK_ONCPU; + } else if (set & TSK_ONCPU) { + state_mask = TSK_ONCPU; + set &= ~TSK_ONCPU; + } else { + state_mask = groupc->state_mask & TSK_ONCPU; + } + + /* + * The rest of the state mask is calculated based on the task + * counts. Update those first, then construct the mask. + */ for (t = 0, m = clear; m; m &= ~(1 << t), t++) { if (!(m & (1 << t))) continue; @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu, if (set & (1 << t)) groupc->tasks[t]++; - /* Calculate state mask representing active states */ for (s = 0; s < NR_PSI_STATES; s++) { - if (test_state(groupc->tasks, s)) + if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU)) state_mask |= (1 << s); } @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu, * task in a cgroup is in_memstall, the corresponding groupc * on that cpu is in PSI_MEM_FULL state. */ - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) + if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |= (1 << PSI_MEM_FULL); groupc->state_mask = state_mask; @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, iter = NULL; while ((group = iterate_groups(next, &iter))) { if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + (per_cpu_ptr(group->pcpu, cpu)->state_mask & + TSK_ONCPU)) { common = group; break; } ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [External] Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-25 18:26 ` Johannes Weiner @ 2022-07-26 13:55 ` Chengming Zhou 2022-07-27 11:28 ` Chengming Zhou 1 sibling, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-26 13:55 UTC (permalink / raw) To: Johannes Weiner Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/26 02:26, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote: >> Now PSI already tracked workload pressure stall information for >> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >> obvious impact on some workload productivity, such as web service >> workload. >> >> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >> from update_rq_clock_task(), in which we can record that delta >> to CPU curr task's cgroups as PSI_IRQ_FULL status. >> >> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in >> the current task on the CPU, make nothing productive could run >> even if it were runnable, so we only use PSI_IRQ_FULL. > > That sounds reasonable. > >> For performance impact consideration, this is enabled by default when >> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline >> parameter "psi_irq=". > > If there isn't a concrete usecase already, let's not add another > commandline parameter for now. Ok, will remove it. > >> @@ -63,9 +64,11 @@ enum psi_states { >> PSI_MEM_FULL, >> PSI_CPU_SOME, >> PSI_CPU_FULL, >> + PSI_IRQ_SOME, >> + PSI_IRQ_FULL, >> /* Only per-CPU, to weigh the CPU in the global average: */ >> PSI_NONIDLE, >> - NR_PSI_STATES = 7, >> + NR_PSI_STATES = 9, >> }; > > Unfortunately, this grows the psi state touched by the scheduler into > a second cacheline. :( Please reclaim space first. Thank you for pointing this out! > > I think we can remove the NR_CPU task count, which frees up one > u32. Something like the below diff should work (untested!) > > And you should be able to remove PSI_IRQ_SOME, since it's not used > anyway. Then we'd be good to go. Very good solution for this, I will test it later. Thanks! > > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index c7fe7c089718..31dc76e2d8ea 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -15,13 +15,6 @@ enum psi_task_count { > NR_IOWAIT, > NR_MEMSTALL, > NR_RUNNING, > - /* > - * This can't have values other than 0 or 1 and could be > - * implemented as a bit flag. But for now we still have room > - * in the first cacheline of psi_group_cpu, and this way we > - * don't have to special case any state tracking for it. > - */ > - NR_ONCPU, > /* > * For IO and CPU stalls the presence of running/oncpu tasks > * in the domain means a partial rather than a full stall. > @@ -39,9 +32,11 @@ enum psi_task_count { > #define TSK_IOWAIT (1 << NR_IOWAIT) > #define TSK_MEMSTALL (1 << NR_MEMSTALL) > #define TSK_RUNNING (1 << NR_RUNNING) > -#define TSK_ONCPU (1 << NR_ONCPU) > #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING) > > +/* Only one task can be scheduled, no corresponding task count */ > +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS) > + > /* Resources that workloads could be stalled on */ > enum psi_res { > PSI_IO, > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index a4fa3aadfcba..232e1dbfad46 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -215,7 +215,7 @@ void __init psi_init(void) > group_init(&psi_system); > } > > -static bool test_state(unsigned int *tasks, enum psi_states state) > +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu) > { > switch (state) { > case PSI_IO_SOME: > @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state) > return unlikely(tasks[NR_MEMSTALL] && > tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); > case PSI_CPU_SOME: > - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); > + return unlikely(tasks[NR_RUNNING] > oncpu); > case PSI_CPU_FULL: > - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); > + return unlikely(tasks[NR_RUNNING] && !oncpu); > case PSI_NONIDLE: > return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || > tasks[NR_RUNNING]; > @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu, > bool wake_clock) > { > struct psi_group_cpu *groupc; > - u32 state_mask = 0; > unsigned int t, m; > enum psi_states s; > + u32 state_mask; > > groupc = per_cpu_ptr(group->pcpu, cpu); > > @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu, > > record_times(groupc, now); > > + /* > + * Start with TSK_ONCPU, which doesn't have a corresponding > + * task count - it's just a boolean flag directly encoded in > + * the state mask. Clear, set, or carry the current state if > + * no changes are requested. > + */ > + if (clear & TSK_ONCPU) { > + state_mask = 0; > + clear &= ~TSK_ONCPU; > + } else if (set & TSK_ONCPU) { > + state_mask = TSK_ONCPU; > + set &= ~TSK_ONCPU; > + } else { > + state_mask = groupc->state_mask & TSK_ONCPU; > + } > + > + /* > + * The rest of the state mask is calculated based on the task > + * counts. Update those first, then construct the mask. > + */ > for (t = 0, m = clear; m; m &= ~(1 << t), t++) { > if (!(m & (1 << t))) > continue; > @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu, > if (set & (1 << t)) > groupc->tasks[t]++; > > - /* Calculate state mask representing active states */ > for (s = 0; s < NR_PSI_STATES; s++) { > - if (test_state(groupc->tasks, s)) > + if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU)) > state_mask |= (1 << s); > } > > @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu, > * task in a cgroup is in_memstall, the corresponding groupc > * on that cpu is in PSI_MEM_FULL state. > */ > - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) > + if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall)) > state_mask |= (1 << PSI_MEM_FULL); > > groupc->state_mask = state_mask; > @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > iter = NULL; > while ((group = iterate_groups(next, &iter))) { > if (identical_state && > - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { > + (per_cpu_ptr(group->pcpu, cpu)->state_mask & > + TSK_ONCPU)) { > common = group; > break; > } ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-25 18:26 ` Johannes Weiner 2022-07-26 13:55 ` [External] " Chengming Zhou @ 2022-07-27 11:28 ` Chengming Zhou 2022-07-27 13:00 ` Johannes Weiner 1 sibling, 1 reply; 56+ messages in thread From: Chengming Zhou @ 2022-07-27 11:28 UTC (permalink / raw) To: Johannes Weiner Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/26 02:26, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:39PM +0800, Chengming Zhou wrote: >> Now PSI already tracked workload pressure stall information for >> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have >> obvious impact on some workload productivity, such as web service >> workload. >> >> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time >> from update_rq_clock_task(), in which we can record that delta >> to CPU curr task's cgroups as PSI_IRQ_FULL status. >> >> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in >> the current task on the CPU, make nothing productive could run >> even if it were runnable, so we only use PSI_IRQ_FULL. > > That sounds reasonable. > >> For performance impact consideration, this is enabled by default when >> CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline >> parameter "psi_irq=". > > If there isn't a concrete usecase already, let's not add another > commandline parameter for now. > >> @@ -63,9 +64,11 @@ enum psi_states { >> PSI_MEM_FULL, >> PSI_CPU_SOME, >> PSI_CPU_FULL, >> + PSI_IRQ_SOME, >> + PSI_IRQ_FULL, >> /* Only per-CPU, to weigh the CPU in the global average: */ >> PSI_NONIDLE, >> - NR_PSI_STATES = 7, >> + NR_PSI_STATES = 9, >> }; > > Unfortunately, this grows the psi state touched by the scheduler into > a second cacheline. :( Please reclaim space first. > > I think we can remove the NR_CPU task count, which frees up one > u32. Something like the below diff should work (untested!) Hi, I tested ok, would you mind if I put this patch in this series? Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting We put all fields updated by the scheduler in the first cacheline of struct psi_group_cpu for performance. Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, we need to reclaim space first. This patch remove NR_ONCPU task accounting in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> > > And you should be able to remove PSI_IRQ_SOME, since it's not used > anyway. Then we'd be good to go. > > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h > index c7fe7c089718..31dc76e2d8ea 100644 > --- a/include/linux/psi_types.h > +++ b/include/linux/psi_types.h > @@ -15,13 +15,6 @@ enum psi_task_count { > NR_IOWAIT, > NR_MEMSTALL, > NR_RUNNING, > - /* > - * This can't have values other than 0 or 1 and could be > - * implemented as a bit flag. But for now we still have room > - * in the first cacheline of psi_group_cpu, and this way we > - * don't have to special case any state tracking for it. > - */ > - NR_ONCPU, > /* > * For IO and CPU stalls the presence of running/oncpu tasks > * in the domain means a partial rather than a full stall. > @@ -39,9 +32,11 @@ enum psi_task_count { > #define TSK_IOWAIT (1 << NR_IOWAIT) > #define TSK_MEMSTALL (1 << NR_MEMSTALL) > #define TSK_RUNNING (1 << NR_RUNNING) > -#define TSK_ONCPU (1 << NR_ONCPU) > #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING) > > +/* Only one task can be scheduled, no corresponding task count */ > +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS) > + > /* Resources that workloads could be stalled on */ > enum psi_res { > PSI_IO, > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index a4fa3aadfcba..232e1dbfad46 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -215,7 +215,7 @@ void __init psi_init(void) > group_init(&psi_system); > } > > -static bool test_state(unsigned int *tasks, enum psi_states state) > +static bool test_state(unsigned int *tasks, enum psi_states state, bool oncpu) > { > switch (state) { > case PSI_IO_SOME: > @@ -228,9 +228,9 @@ static bool test_state(unsigned int *tasks, enum psi_states state) > return unlikely(tasks[NR_MEMSTALL] && > tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); > case PSI_CPU_SOME: > - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); > + return unlikely(tasks[NR_RUNNING] > oncpu); > case PSI_CPU_FULL: > - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); > + return unlikely(tasks[NR_RUNNING] && !oncpu); > case PSI_NONIDLE: > return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || > tasks[NR_RUNNING]; > @@ -692,9 +692,9 @@ static void psi_group_change(struct psi_group *group, int cpu, > bool wake_clock) > { > struct psi_group_cpu *groupc; > - u32 state_mask = 0; > unsigned int t, m; > enum psi_states s; > + u32 state_mask; > > groupc = per_cpu_ptr(group->pcpu, cpu); > > @@ -710,6 +710,26 @@ static void psi_group_change(struct psi_group *group, int cpu, > > record_times(groupc, now); > > + /* > + * Start with TSK_ONCPU, which doesn't have a corresponding > + * task count - it's just a boolean flag directly encoded in > + * the state mask. Clear, set, or carry the current state if > + * no changes are requested. > + */ > + if (clear & TSK_ONCPU) { > + state_mask = 0; > + clear &= ~TSK_ONCPU; > + } else if (set & TSK_ONCPU) { > + state_mask = TSK_ONCPU; > + set &= ~TSK_ONCPU; > + } else { > + state_mask = groupc->state_mask & TSK_ONCPU; > + } > + > + /* > + * The rest of the state mask is calculated based on the task > + * counts. Update those first, then construct the mask. > + */ > for (t = 0, m = clear; m; m &= ~(1 << t), t++) { > if (!(m & (1 << t))) > continue; > @@ -729,9 +749,8 @@ static void psi_group_change(struct psi_group *group, int cpu, > if (set & (1 << t)) > groupc->tasks[t]++; > > - /* Calculate state mask representing active states */ > for (s = 0; s < NR_PSI_STATES; s++) { > - if (test_state(groupc->tasks, s)) > + if (test_state(groupc->tasks, s, state_mask & TSK_ONCPU)) > state_mask |= (1 << s); > } > > @@ -743,7 +762,7 @@ static void psi_group_change(struct psi_group *group, int cpu, > * task in a cgroup is in_memstall, the corresponding groupc > * on that cpu is in PSI_MEM_FULL state. > */ > - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) > + if (unlikely((state_mask & TSK_ONCPU) && cpu_curr(cpu)->in_memstall)) > state_mask |= (1 << PSI_MEM_FULL); > > groupc->state_mask = state_mask; > @@ -847,7 +866,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > iter = NULL; > while ((group = iterate_groups(next, &iter))) { > if (identical_state && > - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { > + (per_cpu_ptr(group->pcpu, cpu)->state_mask & > + TSK_ONCPU)) { > common = group; > break; > } ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-27 11:28 ` Chengming Zhou @ 2022-07-27 13:00 ` Johannes Weiner 2022-07-27 15:09 ` Chengming Zhou 0 siblings, 1 reply; 56+ messages in thread From: Johannes Weiner @ 2022-07-27 13:00 UTC (permalink / raw) To: Chengming Zhou Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On Wed, Jul 27, 2022 at 07:28:37PM +0800, Chengming Zhou wrote: > On 2022/7/26 02:26, Johannes Weiner wrote: > > I think we can remove the NR_CPU task count, which frees up one > > u32. Something like the below diff should work (untested!) > > Hi, I tested ok, would you mind if I put this patch in this series? > > Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting > > We put all fields updated by the scheduler in the first cacheline of > struct psi_group_cpu for performance. > > Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, > we need to reclaim space first. This patch remove NR_ONCPU task accounting > in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead. Thanks for testing it, that sounds good. > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> > Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Since you're handling the patch, you need to add your own Signed-off-by: as well. And keep From: Johannes (git commit --author). Thanks! ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure 2022-07-27 13:00 ` Johannes Weiner @ 2022-07-27 15:09 ` Chengming Zhou 0 siblings, 0 replies; 56+ messages in thread From: Chengming Zhou @ 2022-07-27 15:09 UTC (permalink / raw) To: Johannes Weiner Cc: surenb, mingo, peterz, tj, corbet, akpm, rdunlap, linux-doc, linux-kernel, songmuchun, cgroups On 2022/7/27 21:00, Johannes Weiner wrote: > On Wed, Jul 27, 2022 at 07:28:37PM +0800, Chengming Zhou wrote: >> On 2022/7/26 02:26, Johannes Weiner wrote: >>> I think we can remove the NR_CPU task count, which frees up one >>> u32. Something like the below diff should work (untested!) >> >> Hi, I tested ok, would you mind if I put this patch in this series? >> >> Subject: [PATCH] sched/psi: remove NR_ONCPU task accounting >> >> We put all fields updated by the scheduler in the first cacheline of >> struct psi_group_cpu for performance. >> >> Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, >> we need to reclaim space first. This patch remove NR_ONCPU task accounting >> in struct psi_group_cpu, use TSK_ONCPU in state_mask to track instead. > > Thanks for testing it, that sounds good. > >> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> >> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> >> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> > > Since you're handling the patch, you need to add your own > Signed-off-by: as well. And keep From: Johannes (git commit --author). Got it. Thanks! ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2022-08-04 16:56 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-21 4:04 [PATCH 0/9] sched/psi: some optimization and extension Chengming Zhou
[not found] ` <20220721040439.2651-1-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-21 4:04 ` [PATCH 1/9] sched/psi: fix periodic aggregation shut off Chengming Zhou
2022-07-21 4:04 ` Chengming Zhou
2022-07-25 15:34 ` Johannes Weiner
[not found] ` <20220721040439.2651-2-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-25 15:39 ` Johannes Weiner
2022-07-25 15:39 ` Johannes Weiner
2022-07-26 13:28 ` Chengming Zhou
2022-07-21 4:04 ` [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Chengming Zhou
2022-07-21 4:04 ` Chengming Zhou
2022-07-21 4:04 ` [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Chengming Zhou
2022-07-21 4:04 ` Chengming Zhou
2022-07-25 16:39 ` Johannes Weiner
2022-07-21 4:04 ` [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group Chengming Zhou
2022-07-21 4:04 ` Chengming Zhou
2022-07-21 4:04 ` [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Chengming Zhou
2022-07-21 4:04 ` Chengming Zhou
2022-07-21 4:04 ` [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Chengming Zhou
2022-07-21 4:04 ` Chengming Zhou
[not found] ` <20220721040439.2651-9-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-25 16:52 ` Johannes Weiner
2022-07-25 16:52 ` Johannes Weiner
[not found] ` <Yt7KQc0nnOypB2b2-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-07-26 13:38 ` [External] " Chengming Zhou
2022-07-26 13:38 ` Chengming Zhou
2022-07-26 17:54 ` Tejun Heo
[not found] ` <YuAqWprKd6NsWs7C-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2022-08-03 12:17 ` Chengming Zhou
2022-08-03 12:17 ` Chengming Zhou
2022-08-03 17:58 ` Tejun Heo
2022-08-03 19:22 ` Johannes Weiner
2022-08-03 19:48 ` Tejun Heo
2022-08-04 13:51 ` Chengming Zhou
[not found] ` <f8444db4-3235-d108-698a-6772e03a6b67-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-08-04 16:56 ` Johannes Weiner
2022-08-04 16:56 ` Johannes Weiner
2022-08-04 2:02 ` Chengming Zhou
2022-07-21 4:04 ` [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Chengming Zhou
[not found] ` <20220721040439.2651-6-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-25 16:41 ` Johannes Weiner
2022-07-25 16:41 ` Johannes Weiner
2022-07-21 4:04 ` [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Chengming Zhou
2022-07-25 16:47 ` Johannes Weiner
2022-07-21 4:04 ` [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Chengming Zhou
[not found] ` <20220721040439.2651-10-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-21 10:00 ` kernel test robot
2022-07-21 10:00 ` kernel test robot
2022-07-21 22:10 ` kernel test robot
2022-07-21 22:10 ` kernel test robot
2022-07-22 3:30 ` Abel Wu
2022-07-22 3:30 ` Abel Wu
[not found] ` <65d9f79b-be9b-e21e-0624-5c9f2cc0c0b2-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-22 6:13 ` Chengming Zhou
2022-07-22 6:13 ` Chengming Zhou
2022-07-22 7:14 ` Abel Wu
[not found] ` <5e5d41e2-5f89-8c52-11e5-0c55c5595a88-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2022-07-22 7:33 ` Chengming Zhou
2022-07-22 7:33 ` Chengming Zhou
2022-07-27 16:07 ` Peter Zijlstra
2022-07-27 16:07 ` Peter Zijlstra
2022-07-25 18:26 ` Johannes Weiner
2022-07-26 13:55 ` [External] " Chengming Zhou
2022-07-27 11:28 ` Chengming Zhou
2022-07-27 13:00 ` Johannes Weiner
2022-07-27 15:09 ` Chengming Zhou
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.