* [PATCH v4 01/10] sched/psi: fix periodic aggregation shut off
2022-08-25 16:41 ` Chengming Zhou
@ 2022-08-25 16:41 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
mkoutny-IBi9RG/b67k, surenb-hpIqsD4AKlfQT0dZR+AlfA
Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg, Chengming Zhou
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.
So this patch move this check to psi_task_switch().
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
kernel/sched/psi.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4ff4ce0..39463dcc16bb 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,7 +796,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- bool wake_clock = true;
void *iter = NULL;
u64 now;
@@ -806,19 +805,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
now = cpu_clock(cpu);
- /*
- * Periodic aggregation shuts off if there is a period of no
- * task changes, so we wake it back up if necessary. However,
- * don't do this if the task change is the aggregation worker
- * itself going to sleep, or we'll ping-pong forever.
- */
- if (unlikely((clear & TSK_RUNNING) &&
- (task->flags & PF_WQ_WORKER) &&
- wq_worker_last_func(task) == psi_avgs_work))
- wake_clock = false;
while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, true);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -854,6 +843,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
+ bool wake_clock = true;
/*
* When we're going to sleep, psi_dequeue() lets us
@@ -867,13 +857,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
clear |= TSK_MEMSTALL_RUNNING;
if (prev->in_iowait)
set |= TSK_IOWAIT;
+
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((prev->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(prev) == psi_avgs_work))
+ wake_clock = false;
}
psi_flags_change(prev, clear, set);
iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -882,7 +882,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (sleep) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
}
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH v4 01/10] sched/psi: fix periodic aggregation shut off
@ 2022-08-25 16:41 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes, tj, mkoutny, surenb
Cc: mingo, peterz, gregkh, corbet, cgroups, linux-doc, linux-kernel,
songmuchun, Chengming Zhou
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.
So this patch move this check to psi_task_switch().
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
kernel/sched/psi.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4ff4ce0..39463dcc16bb 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,7 +796,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- bool wake_clock = true;
void *iter = NULL;
u64 now;
@@ -806,19 +805,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
now = cpu_clock(cpu);
- /*
- * Periodic aggregation shuts off if there is a period of no
- * task changes, so we wake it back up if necessary. However,
- * don't do this if the task change is the aggregation worker
- * itself going to sleep, or we'll ping-pong forever.
- */
- if (unlikely((clear & TSK_RUNNING) &&
- (task->flags & PF_WQ_WORKER) &&
- wq_worker_last_func(task) == psi_avgs_work))
- wake_clock = false;
while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, true);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -854,6 +843,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
+ bool wake_clock = true;
/*
* When we're going to sleep, psi_dequeue() lets us
@@ -867,13 +857,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
clear |= TSK_MEMSTALL_RUNNING;
if (prev->in_iowait)
set |= TSK_IOWAIT;
+
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((prev->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(prev) == psi_avgs_work))
+ wake_clock = false;
}
psi_flags_change(prev, clear, set);
iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -882,7 +882,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (sleep) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
}
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [tip: sched/psi] sched/psi: Fix periodic aggregation shut off
2022-08-25 16:41 ` Chengming Zhou
(?)
@ 2022-09-09 14:00 ` tip-bot2 for Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2022-09-09 14:00 UTC (permalink / raw)
To: linux-tip-commits
Cc: Chengming Zhou, Peter Zijlstra (Intel), Johannes Weiner, x86,
linux-kernel
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: c530a3c716b963625e43aa915e0de6b4d1ce8ad9
Gitweb: https://git.kernel.org/tip/c530a3c716b963625e43aa915e0de6b4d1ce8ad9
Author: Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate: Fri, 26 Aug 2022 00:41:02 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 09 Sep 2022 11:08:30 +02:00
sched/psi: Fix periodic aggregation shut off
We don't want to wake periodic aggregation work back up if the
task change is the aggregation worker itself going to sleep, or
we'll ping-pong forever.
Previously, we would use psi_task_change() in psi_dequeue() when
task going to sleep, so this check was put in psi_task_change().
But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
defer task sleep handling to psi_task_switch(), won't go through
psi_task_change() anymore.
So this patch move this check to psi_task_switch().
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-2-zhouchengming@bytedance.com
---
kernel/sched/psi.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ecb4b4f..39463dc 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -796,7 +796,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- bool wake_clock = true;
void *iter = NULL;
u64 now;
@@ -806,19 +805,9 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
now = cpu_clock(cpu);
- /*
- * Periodic aggregation shuts off if there is a period of no
- * task changes, so we wake it back up if necessary. However,
- * don't do this if the task change is the aggregation worker
- * itself going to sleep, or we'll ping-pong forever.
- */
- if (unlikely((clear & TSK_RUNNING) &&
- (task->flags & PF_WQ_WORKER) &&
- wq_worker_last_func(task) == psi_avgs_work))
- wake_clock = false;
while ((group = iterate_groups(task, &iter)))
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, true);
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -854,6 +843,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
+ bool wake_clock = true;
/*
* When we're going to sleep, psi_dequeue() lets us
@@ -867,13 +857,23 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
clear |= TSK_MEMSTALL_RUNNING;
if (prev->in_iowait)
set |= TSK_IOWAIT;
+
+ /*
+ * Periodic aggregation shuts off if there is a period of no
+ * task changes, so we wake it back up if necessary. However,
+ * don't do this if the task change is the aggregation worker
+ * itself going to sleep, or we'll ping-pong forever.
+ */
+ if (unlikely((prev->flags & PF_WQ_WORKER) &&
+ wq_worker_last_func(prev) == psi_avgs_work))
+ wake_clock = false;
}
psi_flags_change(prev, clear, set);
iter = NULL;
while ((group = iterate_groups(prev, &iter)) && group != common)
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
/*
* TSK_ONCPU is handled up to the common ancestor. If we're tasked
@@ -882,7 +882,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if (sleep) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, &iter))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
}
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 03/10] sched/psi: save percpu memory when !psi_cgroups_enabled
2022-08-25 16:41 ` Chengming Zhou
@ 2022-08-25 16:41 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
mkoutny-IBi9RG/b67k, surenb-hpIqsD4AKlfQT0dZR+AlfA
Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg, Chengming Zhou
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.
Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().
Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 39463dcc16bb..77d53c03a76f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -201,6 +201,7 @@ void __init psi_init(void)
{
if (!psi_enable) {
static_branch_enable(&psi_disabled);
+ static_branch_disable(&psi_cgroups_enabled);
return;
}
@@ -950,7 +951,7 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL);
@@ -968,7 +969,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return;
cancel_delayed_work_sync(&cgroup->psi->avgs_work);
@@ -996,7 +997,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
struct rq_flags rf;
struct rq *rq;
- if (static_branch_likely(&psi_disabled)) {
+ if (!static_branch_likely(&psi_cgroups_enabled)) {
/*
* Lame to do this here, but the scheduler cannot be locked
* from the outside, so we move cgroups from inside sched/.
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH v4 03/10] sched/psi: save percpu memory when !psi_cgroups_enabled
@ 2022-08-25 16:41 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes, tj, mkoutny, surenb
Cc: mingo, peterz, gregkh, corbet, cgroups, linux-doc, linux-kernel,
songmuchun, Chengming Zhou
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.
Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 39463dcc16bb..77d53c03a76f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -201,6 +201,7 @@ void __init psi_init(void)
{
if (!psi_enable) {
static_branch_enable(&psi_disabled);
+ static_branch_disable(&psi_cgroups_enabled);
return;
}
@@ -950,7 +951,7 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL);
@@ -968,7 +969,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return;
cancel_delayed_work_sync(&cgroup->psi->avgs_work);
@@ -996,7 +997,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
struct rq_flags rf;
struct rq *rq;
- if (static_branch_likely(&psi_disabled)) {
+ if (!static_branch_likely(&psi_cgroups_enabled)) {
/*
* Lame to do this here, but the scheduler cannot be locked
* from the outside, so we move cgroups from inside sched/.
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [tip: sched/psi] sched/psi: Save percpu memory when !psi_cgroups_enabled
2022-08-25 16:41 ` Chengming Zhou
(?)
@ 2022-09-09 14:00 ` tip-bot2 for Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2022-09-09 14:00 UTC (permalink / raw)
To: linux-tip-commits
Cc: Chengming Zhou, Peter Zijlstra (Intel), Johannes Weiner, x86,
linux-kernel
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: e2ad8ab04c5cdfc8dc2f382c45d248ab01dee991
Gitweb: https://git.kernel.org/tip/e2ad8ab04c5cdfc8dc2f382c45d248ab01dee991
Author: Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate: Fri, 26 Aug 2022 00:41:04 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 09 Sep 2022 11:08:31 +02:00
sched/psi: Save percpu memory when !psi_cgroups_enabled
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't
bother to alloc percpu memory and init for it.
Also don't need to migrate task PSI stats between cgroups in
cgroup_move_task().
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-4-zhouchengming@bytedance.com
---
kernel/sched/psi.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 39463dc..77d53c0 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -201,6 +201,7 @@ void __init psi_init(void)
{
if (!psi_enable) {
static_branch_enable(&psi_disabled);
+ static_branch_disable(&psi_cgroups_enabled);
return;
}
@@ -950,7 +951,7 @@ void psi_memstall_leave(unsigned long *flags)
#ifdef CONFIG_CGROUPS
int psi_cgroup_alloc(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return 0;
cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL);
@@ -968,7 +969,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup)
{
- if (static_branch_likely(&psi_disabled))
+ if (!static_branch_likely(&psi_cgroups_enabled))
return;
cancel_delayed_work_sync(&cgroup->psi->avgs_work);
@@ -996,7 +997,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
struct rq_flags rf;
struct rq *rq;
- if (static_branch_likely(&psi_disabled)) {
+ if (!static_branch_likely(&psi_cgroups_enabled)) {
/*
* Lame to do this here, but the scheduler cannot be locked
* from the outside, so we move cgroups from inside sched/.
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 07/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
2022-08-25 16:41 ` Chengming Zhou
@ 2022-08-25 16:41 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
mkoutny-IBi9RG/b67k, surenb-hpIqsD4AKlfQT0dZR+AlfA
Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg, Chengming Zhou
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++
include/linux/psi_types.h | 10 +++-
kernel/cgroup/cgroup.c | 27 +++++++++
kernel/sched/core.c | 1 +
kernel/sched/psi.c | 74 ++++++++++++++++++++++++-
kernel/sched/stats.h | 2 +
6 files changed, 116 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index be4a77baf784..971c418bc778 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,12 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 54cb74946db4..40c28171cd91 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -42,7 +42,10 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES = 3,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ,
+#endif
+ NR_PSI_RESOURCES,
};
/*
@@ -58,9 +61,12 @@ enum psi_states {
PSI_MEM_FULL,
PSI_CPU_SOME,
PSI_CPU_FULL,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ_FULL,
+#endif
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 7,
+ NR_PSI_STATES,
};
/* Use one bit in the state mask to track TSK_ONCPU */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2f79ddf9a85d..371131a8b6f8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3731,6 +3731,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+
+ return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5150,6 +5167,16 @@ static struct cftype cgroup_base_files[] = {
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ {
+ .name = "irq.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_irq_pressure_show,
+ .write = cgroup_irq_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
+ },
+#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61436b8e0337..178f9836ae96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+ psi_account_irqtime(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4702a770e272..2545a78f82d8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -904,6 +904,36 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+ int cpu = task_cpu(task);
+ void *iter = NULL;
+ struct psi_group *group;
+ struct psi_group_cpu *groupc;
+ u64 now;
+
+ if (!task->pid)
+ return;
+
+ now = cpu_clock(cpu);
+
+ while ((group = iterate_groups(task, &iter))) {
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, now);
+ groupc->times[PSI_IRQ_FULL] += delta;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (group->poll_states & (1 << PSI_IRQ_FULL))
+ psi_schedule_poll_work(group, 1);
+ }
+}
+#endif
+
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
@@ -1065,6 +1095,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
+ bool only_full = false;
int full;
u64 now;
@@ -1079,7 +1110,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) {
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ only_full = res == PSI_IRQ;
+#endif
+
+ for (full = 0; full < 2 - only_full; full++) {
unsigned long avg[3] = { 0, };
u64 total = 0;
int w;
@@ -1093,7 +1128,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
}
seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
- full ? "full" : "some",
+ full || only_full ? "full" : "some",
LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
@@ -1121,6 +1156,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
else
return ERR_PTR(-EINVAL);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+ return ERR_PTR(-EINVAL);
+#endif
+
if (state >= PSI_NONIDLE)
return ERR_PTR(-EINVAL);
@@ -1405,6 +1445,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
.proc_release = psi_fop_release,
};
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+ return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+ .proc_open = psi_irq_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = psi_irq_write,
+ .proc_poll = psi_fop_poll,
+ .proc_release = psi_fop_release,
+};
+#endif
+
static int __init psi_proc_init(void)
{
if (psi_enable) {
@@ -1412,6 +1479,9 @@ static int __init psi_proc_init(void)
proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
}
return 0;
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..84a188913cc9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,6 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
+void psi_account_irqtime(struct task_struct *task, u32 delta);
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -205,6 +206,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH v4 07/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
@ 2022-08-25 16:41 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes, tj, mkoutny, surenb
Cc: mingo, peterz, gregkh, corbet, cgroups, linux-doc, linux-kernel,
songmuchun, Chengming Zhou
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++
include/linux/psi_types.h | 10 +++-
kernel/cgroup/cgroup.c | 27 +++++++++
kernel/sched/core.c | 1 +
kernel/sched/psi.c | 74 ++++++++++++++++++++++++-
kernel/sched/stats.h | 2 +
6 files changed, 116 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index be4a77baf784..971c418bc778 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,12 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 54cb74946db4..40c28171cd91 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -42,7 +42,10 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES = 3,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ,
+#endif
+ NR_PSI_RESOURCES,
};
/*
@@ -58,9 +61,12 @@ enum psi_states {
PSI_MEM_FULL,
PSI_CPU_SOME,
PSI_CPU_FULL,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ_FULL,
+#endif
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 7,
+ NR_PSI_STATES,
};
/* Use one bit in the state mask to track TSK_ONCPU */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2f79ddf9a85d..371131a8b6f8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3731,6 +3731,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+
+ return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5150,6 +5167,16 @@ static struct cftype cgroup_base_files[] = {
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ {
+ .name = "irq.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_irq_pressure_show,
+ .write = cgroup_irq_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
+ },
+#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61436b8e0337..178f9836ae96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+ psi_account_irqtime(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4702a770e272..2545a78f82d8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -904,6 +904,36 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+ int cpu = task_cpu(task);
+ void *iter = NULL;
+ struct psi_group *group;
+ struct psi_group_cpu *groupc;
+ u64 now;
+
+ if (!task->pid)
+ return;
+
+ now = cpu_clock(cpu);
+
+ while ((group = iterate_groups(task, &iter))) {
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, now);
+ groupc->times[PSI_IRQ_FULL] += delta;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (group->poll_states & (1 << PSI_IRQ_FULL))
+ psi_schedule_poll_work(group, 1);
+ }
+}
+#endif
+
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
@@ -1065,6 +1095,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
+ bool only_full = false;
int full;
u64 now;
@@ -1079,7 +1110,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) {
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ only_full = res == PSI_IRQ;
+#endif
+
+ for (full = 0; full < 2 - only_full; full++) {
unsigned long avg[3] = { 0, };
u64 total = 0;
int w;
@@ -1093,7 +1128,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
}
seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
- full ? "full" : "some",
+ full || only_full ? "full" : "some",
LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
@@ -1121,6 +1156,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
else
return ERR_PTR(-EINVAL);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+ return ERR_PTR(-EINVAL);
+#endif
+
if (state >= PSI_NONIDLE)
return ERR_PTR(-EINVAL);
@@ -1405,6 +1445,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
.proc_release = psi_fop_release,
};
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+ return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+ .proc_open = psi_irq_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = psi_irq_write,
+ .proc_poll = psi_fop_poll,
+ .proc_release = psi_fop_release,
+};
+#endif
+
static int __init psi_proc_init(void)
{
if (psi_enable) {
@@ -1412,6 +1479,9 @@ static int __init psi_proc_init(void)
proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
}
return 0;
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467ece43..84a188913cc9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,6 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
+void psi_account_irqtime(struct task_struct *task, u32 delta);
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -205,6 +206,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread[parent not found: <20220825164111.29534-8-zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH v4 07/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
2022-08-25 16:41 ` Chengming Zhou
@ 2022-08-25 17:17 ` Johannes Weiner
-1 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-08-25 17:17 UTC (permalink / raw)
To: Chengming Zhou
Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, mkoutny-IBi9RG/b67k,
surenb-hpIqsD4AKlfQT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg
On Fri, Aug 26, 2022 at 12:41:08AM +0800, Chengming Zhou wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
>
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>
> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
> the current task on the CPU, make nothing productive could run
> even if it were runnable, so we only use PSI_IRQ_FULL.
>
> Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 07/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
@ 2022-08-25 17:17 ` Johannes Weiner
0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2022-08-25 17:17 UTC (permalink / raw)
To: Chengming Zhou
Cc: tj, mkoutny, surenb, mingo, peterz, gregkh, corbet, cgroups,
linux-doc, linux-kernel, songmuchun
On Fri, Aug 26, 2022 at 12:41:08AM +0800, Chengming Zhou wrote:
> Now PSI already tracked workload pressure stall information for
> CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
> obvious impact on some workload productivity, such as web service
> workload.
>
> When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
> from update_rq_clock_task(), in which we can record that delta
> to CPU curr task's cgroups as PSI_IRQ_FULL status.
>
> Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
> the current task on the CPU, make nothing productive could run
> even if it were runnable, so we only use PSI_IRQ_FULL.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 38+ messages in thread
* [tip: sched/psi] sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
2022-08-25 16:41 ` Chengming Zhou
(?)
(?)
@ 2022-09-09 14:00 ` tip-bot2 for Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2022-09-09 14:00 UTC (permalink / raw)
To: linux-tip-commits
Cc: Chengming Zhou, Peter Zijlstra (Intel), Johannes Weiner, x86,
linux-kernel
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 52b1364ba0b105122d6de0e719b36db705011ac1
Gitweb: https://git.kernel.org/tip/52b1364ba0b105122d6de0e719b36db705011ac1
Author: Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate: Fri, 26 Aug 2022 00:41:08 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 09 Sep 2022 11:08:32 +02:00
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++-
include/linux/psi_types.h | 10 ++-
kernel/cgroup/cgroup.c | 27 +++++++++-
kernel/sched/core.c | 1 +-
kernel/sched/psi.c | 74 +++++++++++++++++++++++-
kernel/sched/stats.h | 2 +-
6 files changed, 116 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index be4a77b..971c418 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,6 +976,12 @@ All cgroup core files are prefixed with "cgroup."
killing cgroups is a process directed operation, i.e. it affects
the whole thread-group.
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 54cb749..40c2817 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -42,7 +42,10 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
- NR_PSI_RESOURCES = 3,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ,
+#endif
+ NR_PSI_RESOURCES,
};
/*
@@ -58,9 +61,12 @@ enum psi_states {
PSI_MEM_FULL,
PSI_CPU_SOME,
PSI_CPU_FULL,
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ PSI_IRQ_FULL,
+#endif
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
- NR_PSI_STATES = 7,
+ NR_PSI_STATES,
};
/* Use one bit in the state mask to track TSK_ONCPU */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 96aefdb..b46d39b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3763,6 +3763,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
+{
+ struct cgroup *cgrp = seq_css(seq)->cgroup;
+ struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+
+ return psi_show(seq, psi, PSI_IRQ);
+}
+
+static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ);
+}
+#endif
+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
@@ -5179,6 +5196,16 @@ static struct cftype cgroup_base_files[] = {
.poll = cgroup_pressure_poll,
.release = cgroup_pressure_release,
},
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ {
+ .name = "irq.pressure",
+ .flags = CFTYPE_PRESSURE,
+ .seq_show = cgroup_irq_pressure_show,
+ .write = cgroup_irq_pressure_write,
+ .poll = cgroup_pressure_poll,
+ .release = cgroup_pressure_release,
+ },
+#endif
#endif /* CONFIG_PSI */
{ } /* terminate */
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253..7d1ea92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+ psi_account_irqtime(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4702a77..2545a78 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -904,6 +904,36 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+void psi_account_irqtime(struct task_struct *task, u32 delta)
+{
+ int cpu = task_cpu(task);
+ void *iter = NULL;
+ struct psi_group *group;
+ struct psi_group_cpu *groupc;
+ u64 now;
+
+ if (!task->pid)
+ return;
+
+ now = cpu_clock(cpu);
+
+ while ((group = iterate_groups(task, &iter))) {
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, now);
+ groupc->times[PSI_IRQ_FULL] += delta;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (group->poll_states & (1 << PSI_IRQ_FULL))
+ psi_schedule_poll_work(group, 1);
+ }
+}
+#endif
+
/**
* psi_memstall_enter - mark the beginning of a memory stall section
* @flags: flags to handle nested sections
@@ -1065,6 +1095,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
+ bool only_full = false;
int full;
u64 now;
@@ -1079,7 +1110,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
group->avg_next_update = update_averages(group, now);
mutex_unlock(&group->avgs_lock);
- for (full = 0; full < 2; full++) {
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ only_full = res == PSI_IRQ;
+#endif
+
+ for (full = 0; full < 2 - only_full; full++) {
unsigned long avg[3] = { 0, };
u64 total = 0;
int w;
@@ -1093,7 +1128,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
}
seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
- full ? "full" : "some",
+ full || only_full ? "full" : "some",
LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
@@ -1121,6 +1156,11 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
else
return ERR_PTR(-EINVAL);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+ return ERR_PTR(-EINVAL);
+#endif
+
if (state >= PSI_NONIDLE)
return ERR_PTR(-EINVAL);
@@ -1405,6 +1445,33 @@ static const struct proc_ops psi_cpu_proc_ops = {
.proc_release = psi_fop_release,
};
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+static int psi_irq_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IRQ);
+}
+
+static int psi_irq_open(struct inode *inode, struct file *file)
+{
+ return psi_open(file, psi_irq_show);
+}
+
+static ssize_t psi_irq_write(struct file *file, const char __user *user_buf,
+ size_t nbytes, loff_t *ppos)
+{
+ return psi_write(file, user_buf, nbytes, PSI_IRQ);
+}
+
+static const struct proc_ops psi_irq_proc_ops = {
+ .proc_open = psi_irq_open,
+ .proc_read = seq_read,
+ .proc_lseek = seq_lseek,
+ .proc_write = psi_irq_write,
+ .proc_poll = psi_fop_poll,
+ .proc_release = psi_fop_release,
+};
+#endif
+
static int __init psi_proc_init(void)
{
if (psi_enable) {
@@ -1412,6 +1479,9 @@ static int __init psi_proc_init(void)
proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops);
proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
+ proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops);
+#endif
}
return 0;
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index c39b467..84a1889 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,6 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
+void psi_account_irqtime(struct task_struct *task, u32 delta);
/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -205,6 +206,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
+static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 08/10] sched/psi: consolidate cgroup_psi()
2022-08-25 16:41 ` Chengming Zhou
@ 2022-08-25 16:41 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
mkoutny-IBi9RG/b67k, surenb-hpIqsD4AKlfQT0dZR+AlfA
Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg, Chengming Zhou
cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
include/linux/cgroup.h | 5 -----
include/linux/psi.h | 6 ++++++
kernel/cgroup/cgroup.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7ed1fa7a6fc8..3c48753f2949 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -682,11 +682,6 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
-static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
-{
- return cgrp->psi;
-}
-
bool cgroup_psi_enabled(void);
static inline void cgroup_init_kthreadd(void)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index fffd229fbf19..362a74ca1d3b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/poll.h>
#include <linux/cgroup-defs.h>
+#include <linux/cgroup.h>
struct seq_file;
struct css_set;
@@ -30,6 +31,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+}
+
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 371131a8b6f8..1d392c91ef95 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3657,21 +3657,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IO);
}
static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_MEM);
}
static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_CPU);
}
@@ -3697,7 +3697,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return -EBUSY;
}
- psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ psi = cgroup_psi(cgrp);
new = psi_trigger_create(psi, buf, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
@@ -3735,7 +3735,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IRQ);
}
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH v4 08/10] sched/psi: consolidate cgroup_psi()
@ 2022-08-25 16:41 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes, tj, mkoutny, surenb
Cc: mingo, peterz, gregkh, corbet, cgroups, linux-doc, linux-kernel,
songmuchun, Chengming Zhou
cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/cgroup.h | 5 -----
include/linux/psi.h | 6 ++++++
kernel/cgroup/cgroup.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7ed1fa7a6fc8..3c48753f2949 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -682,11 +682,6 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
-static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
-{
- return cgrp->psi;
-}
-
bool cgroup_psi_enabled(void);
static inline void cgroup_init_kthreadd(void)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index fffd229fbf19..362a74ca1d3b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/poll.h>
#include <linux/cgroup-defs.h>
+#include <linux/cgroup.h>
struct seq_file;
struct css_set;
@@ -30,6 +31,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+}
+
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 371131a8b6f8..1d392c91ef95 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3657,21 +3657,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IO);
}
static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_MEM);
}
static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_CPU);
}
@@ -3697,7 +3697,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return -EBUSY;
}
- psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ psi = cgroup_psi(cgrp);
new = psi_trigger_create(psi, buf, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
@@ -3735,7 +3735,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IRQ);
}
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [tip: sched/psi] sched/psi: Consolidate cgroup_psi()
2022-08-25 16:41 ` Chengming Zhou
(?)
@ 2022-09-09 14:00 ` tip-bot2 for Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2022-09-09 14:00 UTC (permalink / raw)
To: linux-tip-commits
Cc: Chengming Zhou, Peter Zijlstra (Intel), Johannes Weiner, x86,
linux-kernel
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: 57899a6610e67ba26fa3251ebbef4a5ed21efc5d
Gitweb: https://git.kernel.org/tip/57899a6610e67ba26fa3251ebbef4a5ed21efc5d
Author: Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate: Fri, 26 Aug 2022 00:41:09 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 09 Sep 2022 11:08:33 +02:00
sched/psi: Consolidate cgroup_psi()
cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com
---
include/linux/cgroup.h | 5 -----
include/linux/psi.h | 6 ++++++
kernel/cgroup/cgroup.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b0914aa..80cb970 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -673,11 +673,6 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
-static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
-{
- return cgrp->psi;
-}
-
bool cgroup_psi_enabled(void);
static inline void cgroup_init_kthreadd(void)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index fffd229..362a74c 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/poll.h>
#include <linux/cgroup-defs.h>
+#include <linux/cgroup.h>
struct seq_file;
struct css_set;
@@ -30,6 +31,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+}
+
int psi_cgroup_alloc(struct cgroup *cgrp);
void psi_cgroup_free(struct cgroup *cgrp);
void cgroup_move_task(struct task_struct *p, struct css_set *to);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b46d39b..772b35d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3689,21 +3689,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IO);
}
static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_MEM);
}
static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_CPU);
}
@@ -3729,7 +3729,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return -EBUSY;
}
- psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ psi = cgroup_psi(cgrp);
new = psi_trigger_create(psi, buf, res);
if (IS_ERR(new)) {
cgroup_put(cgrp);
@@ -3767,7 +3767,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ struct psi_group *psi = cgroup_psi(cgrp);
return psi_show(seq, psi, PSI_IRQ);
}
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v4 09/10] sched/psi: cache parent psi_group to speed up groups iterate
2022-08-25 16:41 ` Chengming Zhou
@ 2022-08-25 16:41 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
mkoutny-IBi9RG/b67k, surenb-hpIqsD4AKlfQT0dZR+AlfA
Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg, Chengming Zhou
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.
In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.
This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.
Signed-off-by: Chengming Zhou <zhouchengming-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
include/linux/psi_types.h | 2 ++
kernel/sched/psi.c | 49 +++++++++++++++------------------------
2 files changed, 21 insertions(+), 30 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 40c28171cd91..a0b746258c68 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -151,6 +151,8 @@ struct psi_trigger {
};
struct psi_group {
+ struct psi_group *parent;
+
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2545a78f82d8..9a8aee80a087 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -772,27 +772,12 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
{
- if (*iter == &psi_system)
- return NULL;
-
#ifdef CONFIG_CGROUPS
- if (static_branch_likely(&psi_cgroups_enabled)) {
- struct cgroup *cgroup = NULL;
-
- if (!*iter)
- cgroup = task->cgroups->dfl_cgrp;
- else
- cgroup = cgroup_parent(*iter);
-
- if (cgroup && cgroup_parent(cgroup)) {
- *iter = cgroup;
- return cgroup_psi(cgroup);
- }
- }
+ if (static_branch_likely(&psi_cgroups_enabled))
+ return cgroup_psi(task_dfl_cgroup(task));
#endif
- *iter = &psi_system;
return &psi_system;
}
@@ -815,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- void *iter = NULL;
u64 now;
if (!task->pid)
@@ -825,8 +809,10 @@ void psi_task_change(struct task_struct *task, int clear, int set)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter)))
+ group = task_psi_group(task);
+ do {
psi_group_change(group, cpu, clear, set, now, true);
+ } while ((group = group->parent));
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -834,7 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
{
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
- void *iter;
u64 now = cpu_clock(cpu);
if (next->pid) {
@@ -844,8 +829,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* ancestors with @prev, those will already have @prev's
* TSK_ONCPU bit set, and we can stop the iteration there.
*/
- iter = NULL;
- while ((group = iterate_groups(next, &iter))) {
+ group = task_psi_group(next);
+ do {
if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
PSI_ONCPU) {
common = group;
@@ -853,7 +838,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
- }
+ } while ((group = group->parent));
}
if (prev->pid) {
@@ -886,9 +871,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_flags_change(prev, clear, set);
- iter = NULL;
- while ((group = iterate_groups(prev, &iter)) && group != common)
+ group = task_psi_group(prev);
+ do {
+ if (group == common)
+ break;
psi_group_change(group, cpu, clear, set, now, wake_clock);
+ } while ((group = group->parent));
/*
* TSK_ONCPU is handled up to the common ancestor. If there are
@@ -898,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
- for (; group; group = iterate_groups(prev, &iter))
+ for (; group; group = group->parent)
psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
@@ -908,7 +896,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void psi_account_irqtime(struct task_struct *task, u32 delta)
{
int cpu = task_cpu(task);
- void *iter = NULL;
struct psi_group *group;
struct psi_group_cpu *groupc;
u64 now;
@@ -918,7 +905,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter))) {
+ group = task_psi_group(task);
+ do {
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -930,7 +918,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
if (group->poll_states & (1 << PSI_IRQ_FULL))
psi_schedule_poll_work(group, 1);
- }
+ } while ((group = group->parent));
}
#endif
@@ -1010,6 +998,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
return -ENOMEM;
}
group_init(cgroup->psi);
+ cgroup->psi->parent = cgroup_psi(cgroup_parent(cgroup));
return 0;
}
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH v4 09/10] sched/psi: cache parent psi_group to speed up groups iterate
@ 2022-08-25 16:41 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-08-25 16:41 UTC (permalink / raw)
To: hannes, tj, mkoutny, surenb
Cc: mingo, peterz, gregkh, corbet, cgroups, linux-doc, linux-kernel,
songmuchun, Chengming Zhou
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.
In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.
This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/psi_types.h | 2 ++
kernel/sched/psi.c | 49 +++++++++++++++------------------------
2 files changed, 21 insertions(+), 30 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 40c28171cd91..a0b746258c68 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -151,6 +151,8 @@ struct psi_trigger {
};
struct psi_group {
+ struct psi_group *parent;
+
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2545a78f82d8..9a8aee80a087 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -772,27 +772,12 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
{
- if (*iter == &psi_system)
- return NULL;
-
#ifdef CONFIG_CGROUPS
- if (static_branch_likely(&psi_cgroups_enabled)) {
- struct cgroup *cgroup = NULL;
-
- if (!*iter)
- cgroup = task->cgroups->dfl_cgrp;
- else
- cgroup = cgroup_parent(*iter);
-
- if (cgroup && cgroup_parent(cgroup)) {
- *iter = cgroup;
- return cgroup_psi(cgroup);
- }
- }
+ if (static_branch_likely(&psi_cgroups_enabled))
+ return cgroup_psi(task_dfl_cgroup(task));
#endif
- *iter = &psi_system;
return &psi_system;
}
@@ -815,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- void *iter = NULL;
u64 now;
if (!task->pid)
@@ -825,8 +809,10 @@ void psi_task_change(struct task_struct *task, int clear, int set)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter)))
+ group = task_psi_group(task);
+ do {
psi_group_change(group, cpu, clear, set, now, true);
+ } while ((group = group->parent));
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -834,7 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
{
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
- void *iter;
u64 now = cpu_clock(cpu);
if (next->pid) {
@@ -844,8 +829,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* ancestors with @prev, those will already have @prev's
* TSK_ONCPU bit set, and we can stop the iteration there.
*/
- iter = NULL;
- while ((group = iterate_groups(next, &iter))) {
+ group = task_psi_group(next);
+ do {
if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
PSI_ONCPU) {
common = group;
@@ -853,7 +838,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
- }
+ } while ((group = group->parent));
}
if (prev->pid) {
@@ -886,9 +871,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_flags_change(prev, clear, set);
- iter = NULL;
- while ((group = iterate_groups(prev, &iter)) && group != common)
+ group = task_psi_group(prev);
+ do {
+ if (group == common)
+ break;
psi_group_change(group, cpu, clear, set, now, wake_clock);
+ } while ((group = group->parent));
/*
* TSK_ONCPU is handled up to the common ancestor. If there are
@@ -898,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
- for (; group; group = iterate_groups(prev, &iter))
+ for (; group; group = group->parent)
psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
@@ -908,7 +896,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void psi_account_irqtime(struct task_struct *task, u32 delta)
{
int cpu = task_cpu(task);
- void *iter = NULL;
struct psi_group *group;
struct psi_group_cpu *groupc;
u64 now;
@@ -918,7 +905,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter))) {
+ group = task_psi_group(task);
+ do {
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -930,7 +918,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
if (group->poll_states & (1 << PSI_IRQ_FULL))
psi_schedule_poll_work(group, 1);
- }
+ } while ((group = group->parent));
}
#endif
@@ -1010,6 +998,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
return -ENOMEM;
}
group_init(cgroup->psi);
+ cgroup->psi->parent = cgroup_psi(cgroup_parent(cgroup));
return 0;
}
--
2.37.2
^ permalink raw reply related [flat|nested] 38+ messages in thread* [tip: sched/psi] sched/psi: Cache parent psi_group to speed up group iteration
2022-08-25 16:41 ` Chengming Zhou
(?)
@ 2022-09-09 14:00 ` tip-bot2 for Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: tip-bot2 for Chengming Zhou @ 2022-09-09 14:00 UTC (permalink / raw)
To: linux-tip-commits
Cc: Chengming Zhou, Peter Zijlstra (Intel), Johannes Weiner, x86,
linux-kernel
The following commit has been merged into the sched/psi branch of tip:
Commit-ID: dc86aba751e2867244411adda1562f6664747019
Gitweb: https://git.kernel.org/tip/dc86aba751e2867244411adda1562f6664747019
Author: Chengming Zhou <zhouchengming@bytedance.com>
AuthorDate: Fri, 26 Aug 2022 00:41:10 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 09 Sep 2022 11:08:33 +02:00
sched/psi: Cache parent psi_group to speed up group iteration
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.
In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.
This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group->parent to iterate.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-10-zhouchengming@bytedance.com
---
include/linux/psi_types.h | 2 ++-
kernel/sched/psi.c | 49 ++++++++++++++------------------------
2 files changed, 21 insertions(+), 30 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 40c2817..a0b7462 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -151,6 +151,8 @@ struct psi_trigger {
};
struct psi_group {
+ struct psi_group *parent;
+
/* Protects data used by the aggregator */
struct mutex avgs_lock;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2545a78..9a8aee8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -772,27 +772,12 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->avgs_work, PSI_FREQ);
}
-static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+static inline struct psi_group *task_psi_group(struct task_struct *task)
{
- if (*iter == &psi_system)
- return NULL;
-
#ifdef CONFIG_CGROUPS
- if (static_branch_likely(&psi_cgroups_enabled)) {
- struct cgroup *cgroup = NULL;
-
- if (!*iter)
- cgroup = task->cgroups->dfl_cgrp;
- else
- cgroup = cgroup_parent(*iter);
-
- if (cgroup && cgroup_parent(cgroup)) {
- *iter = cgroup;
- return cgroup_psi(cgroup);
- }
- }
+ if (static_branch_likely(&psi_cgroups_enabled))
+ return cgroup_psi(task_dfl_cgroup(task));
#endif
- *iter = &psi_system;
return &psi_system;
}
@@ -815,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
struct psi_group *group;
- void *iter = NULL;
u64 now;
if (!task->pid)
@@ -825,8 +809,10 @@ void psi_task_change(struct task_struct *task, int clear, int set)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter)))
+ group = task_psi_group(task);
+ do {
psi_group_change(group, cpu, clear, set, now, true);
+ } while ((group = group->parent));
}
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -834,7 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
{
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
- void *iter;
u64 now = cpu_clock(cpu);
if (next->pid) {
@@ -844,8 +829,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
* ancestors with @prev, those will already have @prev's
* TSK_ONCPU bit set, and we can stop the iteration there.
*/
- iter = NULL;
- while ((group = iterate_groups(next, &iter))) {
+ group = task_psi_group(next);
+ do {
if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
PSI_ONCPU) {
common = group;
@@ -853,7 +838,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
- }
+ } while ((group = group->parent));
}
if (prev->pid) {
@@ -886,9 +871,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_flags_change(prev, clear, set);
- iter = NULL;
- while ((group = iterate_groups(prev, &iter)) && group != common)
+ group = task_psi_group(prev);
+ do {
+ if (group == common)
+ break;
psi_group_change(group, cpu, clear, set, now, wake_clock);
+ } while ((group = group->parent));
/*
* TSK_ONCPU is handled up to the common ancestor. If there are
@@ -898,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
- for (; group; group = iterate_groups(prev, &iter))
+ for (; group; group = group->parent)
psi_group_change(group, cpu, clear, set, now, wake_clock);
}
}
@@ -908,7 +896,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void psi_account_irqtime(struct task_struct *task, u32 delta)
{
int cpu = task_cpu(task);
- void *iter = NULL;
struct psi_group *group;
struct psi_group_cpu *groupc;
u64 now;
@@ -918,7 +905,8 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
now = cpu_clock(cpu);
- while ((group = iterate_groups(task, &iter))) {
+ group = task_psi_group(task);
+ do {
groupc = per_cpu_ptr(group->pcpu, cpu);
write_seqcount_begin(&groupc->seq);
@@ -930,7 +918,7 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
if (group->poll_states & (1 << PSI_IRQ_FULL))
psi_schedule_poll_work(group, 1);
- }
+ } while ((group = group->parent));
}
#endif
@@ -1010,6 +998,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
return -ENOMEM;
}
group_init(cgroup->psi);
+ cgroup->psi->parent = cgroup_psi(cgroup_parent(cgroup));
return 0;
}
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v4 00/10] sched/psi: some optimizations and extensions
2022-08-25 16:41 ` Chengming Zhou
@ 2022-09-06 13:13 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-09-06 13:13 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A
Cc: surenb-hpIqsD4AKlfQT0dZR+AlfA, mkoutny-IBi9RG/b67k,
mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg
Hello,
Could this series be merged into the linux-next?
Thanks.
On 2022/8/26 00:41, Chengming Zhou wrote:
> Hi all,
>
> This patch series are some optimizations and extensions for PSI.
>
> patch 1/10 fix periodic aggregation shut off problem introduced by earlier
> commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").
>
> patch 2-4 are some misc optimizations, so put them in front of this series.
>
> patch 5/10 optimize task switch inside shared cgroups when in_memstall status
> of prev task and next task are different.
>
> patch 6/10 remove NR_ONCPU task accounting to save 4 bytes in the first
> cacheline to be used by the following patch 7/10, which introduce new
> PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure stall information.
>
> patch 8-9 cache parent psi_group in struct psi_group to speed up the
> hot iteration path.
>
> patch 10/10 introduce a per-cgroup interface "cgroup.pressure" to disable
> or re-enable PSI in the cgroup level, and we implement hiding and unhiding
> the pressure files per Tejun's suggestion[1], which depends on his work[2].
>
> [1] https://lore.kernel.org/all/YvqjhqJQi2J8RG3X-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org/
> [2] https://lore.kernel.org/all/20220820000550.367085-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
>
> Performance test using mmtests/config-scheduler-perfpipe in
> /user.slice/user-0.slice/session-4.scope:
>
> next patched patched/only-leaf
> Min Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> 1st-qrtle Time 8.90 ( 0.00%) 8.58 ( 3.63%) 8.05 ( 9.58%)
> 2nd-qrtle Time 8.94 ( 0.00%) 8.61 ( 3.65%) 8.09 ( 9.50%)
> 3rd-qrtle Time 8.99 ( 0.00%) 8.65 ( 3.75%) 8.15 ( 9.35%)
> Max-1 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> Max-5 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> Max-10 Time 8.84 ( 0.00%) 8.55 ( 3.20%) 8.04 ( 9.05%)
> Max-90 Time 9.04 ( 0.00%) 8.67 ( 4.10%) 8.18 ( 9.51%)
> Max-95 Time 9.04 ( 0.00%) 8.68 ( 4.03%) 8.20 ( 9.26%)
> Max-99 Time 9.07 ( 0.00%) 8.73 ( 3.82%) 8.25 ( 9.11%)
> Max Time 9.12 ( 0.00%) 8.89 ( 2.54%) 8.27 ( 9.29%)
> Amean Time 8.95 ( 0.00%) 8.62 * 3.67%* 8.11 * 9.43%*
>
> Big thanks to Johannes Weiner, Tejun Heo and Michal Koutný for your
> suggestions and review!
>
>
> Changes in v4:
> - Collect Acked-by tags from Johannes Weiner.
> - Add many clear comments and changelogs per Johannes Weiner.
> - Replace for_each_psi_group() with better open-code.
> - Change to use better names cgroup_pressure_show() and
> cgroup_pressure_write().
> - Change to use better name psi_cgroup_restart() and only
> call it on enabling.
>
> Changes in v3:
> - Rebase on linux-next and reorder patches to put misc optimizations
> patches in the front of this series.
> - Drop patch "sched/psi: don't change task psi_flags when migrate CPU/group"
> since it caused a little performance regression and it's just
> code refactoring, so drop it.
> - Don't define PSI_IRQ and PSI_IRQ_FULL when !CONFIG_IRQ_TIME_ACCOUNTING,
> in which case they are not used.
> - Add patch 8/10 "sched/psi: consolidate cgroup_psi()" make cgroup_psi()
> can handle all cgroups including root cgroup, make patch 9/10 simpler.
> - Rename interface to "cgroup.pressure" and add some explanation
> per Michal's suggestion.
> - Hide and unhide pressure files when disable/re-enable cgroup PSI,
> depends on Tejun's work.
>
> Changes in v2:
> - Add Acked-by tags from Johannes Weiner. Thanks for review!
> - Fix periodic aggregation wakeup for common ancestors in
> psi_task_switch().
> - Add patch 7/10 from Johannes Weiner, which remove NR_ONCPU
> task accounting to save 4 bytes in the first cacheline.
> - Remove "psi_irq=" kernel cmdline parameter in last version.
> - Add per-cgroup interface "cgroup.psi" to disable/re-enable
> PSI stats accounting in the cgroup level.
>
>
> Chengming Zhou (9):
> sched/psi: fix periodic aggregation shut off
> sched/psi: don't create cgroup PSI files when psi_disabled
> sched/psi: save percpu memory when !psi_cgroups_enabled
> sched/psi: move private helpers to sched/stats.h
> sched/psi: optimize task switch inside shared cgroups again
> sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
> sched/psi: consolidate cgroup_psi()
> sched/psi: cache parent psi_group to speed up groups iterate
> sched/psi: per-cgroup PSI accounting disable/re-enable interface
>
> Johannes Weiner (1):
> sched/psi: remove NR_ONCPU task accounting
>
> Documentation/admin-guide/cgroup-v2.rst | 23 ++
> include/linux/cgroup-defs.h | 3 +
> include/linux/cgroup.h | 5 -
> include/linux/psi.h | 12 +-
> include/linux/psi_types.h | 29 ++-
> kernel/cgroup/cgroup.c | 106 ++++++++-
> kernel/sched/core.c | 1 +
> kernel/sched/psi.c | 280 +++++++++++++++++-------
> kernel/sched/stats.h | 6 +
> 9 files changed, 362 insertions(+), 103 deletions(-)
>
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: [PATCH v4 00/10] sched/psi: some optimizations and extensions
@ 2022-09-06 13:13 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-09-06 13:13 UTC (permalink / raw)
To: hannes, tj
Cc: surenb, mkoutny, mingo, peterz, gregkh, corbet, cgroups,
linux-doc, linux-kernel, songmuchun
Hello,
Could this series be merged into the linux-next?
Thanks.
On 2022/8/26 00:41, Chengming Zhou wrote:
> Hi all,
>
> This patch series are some optimizations and extensions for PSI.
>
> patch 1/10 fix periodic aggregation shut off problem introduced by earlier
> commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups").
>
> patch 2-4 are some misc optimizations, so put them in front of this series.
>
> patch 5/10 optimize task switch inside shared cgroups when in_memstall status
> of prev task and next task are different.
>
> patch 6/10 remove NR_ONCPU task accounting to save 4 bytes in the first
> cacheline to be used by the following patch 7/10, which introduce new
> PSI resource PSI_IRQ to track IRQ/SOFTIRQ pressure stall information.
>
> patch 8-9 cache parent psi_group in struct psi_group to speed up the
> hot iteration path.
>
> patch 10/10 introduce a per-cgroup interface "cgroup.pressure" to disable
> or re-enable PSI in the cgroup level, and we implement hiding and unhiding
> the pressure files per Tejun's suggestion[1], which depends on his work[2].
>
> [1] https://lore.kernel.org/all/YvqjhqJQi2J8RG3X@slm.duckdns.org/
> [2] https://lore.kernel.org/all/20220820000550.367085-1-tj@kernel.org/
>
> Performance test using mmtests/config-scheduler-perfpipe in
> /user.slice/user-0.slice/session-4.scope:
>
> next patched patched/only-leaf
> Min Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> 1st-qrtle Time 8.90 ( 0.00%) 8.58 ( 3.63%) 8.05 ( 9.58%)
> 2nd-qrtle Time 8.94 ( 0.00%) 8.61 ( 3.65%) 8.09 ( 9.50%)
> 3rd-qrtle Time 8.99 ( 0.00%) 8.65 ( 3.75%) 8.15 ( 9.35%)
> Max-1 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> Max-5 Time 8.82 ( 0.00%) 8.49 ( 3.74%) 8.00 ( 9.32%)
> Max-10 Time 8.84 ( 0.00%) 8.55 ( 3.20%) 8.04 ( 9.05%)
> Max-90 Time 9.04 ( 0.00%) 8.67 ( 4.10%) 8.18 ( 9.51%)
> Max-95 Time 9.04 ( 0.00%) 8.68 ( 4.03%) 8.20 ( 9.26%)
> Max-99 Time 9.07 ( 0.00%) 8.73 ( 3.82%) 8.25 ( 9.11%)
> Max Time 9.12 ( 0.00%) 8.89 ( 2.54%) 8.27 ( 9.29%)
> Amean Time 8.95 ( 0.00%) 8.62 * 3.67%* 8.11 * 9.43%*
>
> Big thanks to Johannes Weiner, Tejun Heo and Michal Koutný for your
> suggestions and review!
>
>
> Changes in v4:
> - Collect Acked-by tags from Johannes Weiner.
> - Add many clear comments and changelogs per Johannes Weiner.
> - Replace for_each_psi_group() with better open-code.
> - Change to use better names cgroup_pressure_show() and
> cgroup_pressure_write().
> - Change to use better name psi_cgroup_restart() and only
> call it on enabling.
>
> Changes in v3:
> - Rebase on linux-next and reorder patches to put misc optimizations
> patches in the front of this series.
> - Drop patch "sched/psi: don't change task psi_flags when migrate CPU/group"
> since it caused a little performance regression and it's just
> code refactoring, so drop it.
> - Don't define PSI_IRQ and PSI_IRQ_FULL when !CONFIG_IRQ_TIME_ACCOUNTING,
> in which case they are not used.
> - Add patch 8/10 "sched/psi: consolidate cgroup_psi()" make cgroup_psi()
> can handle all cgroups including root cgroup, make patch 9/10 simpler.
> - Rename interface to "cgroup.pressure" and add some explanation
> per Michal's suggestion.
> - Hide and unhide pressure files when disable/re-enable cgroup PSI,
> depends on Tejun's work.
>
> Changes in v2:
> - Add Acked-by tags from Johannes Weiner. Thanks for review!
> - Fix periodic aggregation wakeup for common ancestors in
> psi_task_switch().
> - Add patch 7/10 from Johannes Weiner, which remove NR_ONCPU
> task accounting to save 4 bytes in the first cacheline.
> - Remove "psi_irq=" kernel cmdline parameter in last version.
> - Add per-cgroup interface "cgroup.psi" to disable/re-enable
> PSI stats accounting in the cgroup level.
>
>
> Chengming Zhou (9):
> sched/psi: fix periodic aggregation shut off
> sched/psi: don't create cgroup PSI files when psi_disabled
> sched/psi: save percpu memory when !psi_cgroups_enabled
> sched/psi: move private helpers to sched/stats.h
> sched/psi: optimize task switch inside shared cgroups again
> sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
> sched/psi: consolidate cgroup_psi()
> sched/psi: cache parent psi_group to speed up groups iterate
> sched/psi: per-cgroup PSI accounting disable/re-enable interface
>
> Johannes Weiner (1):
> sched/psi: remove NR_ONCPU task accounting
>
> Documentation/admin-guide/cgroup-v2.rst | 23 ++
> include/linux/cgroup-defs.h | 3 +
> include/linux/cgroup.h | 5 -
> include/linux/psi.h | 12 +-
> include/linux/psi_types.h | 29 ++-
> kernel/cgroup/cgroup.c | 106 ++++++++-
> kernel/sched/core.c | 1 +
> kernel/sched/psi.c | 280 +++++++++++++++++-------
> kernel/sched/stats.h | 6 +
> 9 files changed, 362 insertions(+), 103 deletions(-)
>
^ permalink raw reply [flat|nested] 38+ messages in thread[parent not found: <be071d5a-ff2d-d06e-2f89-f2ca247dd19e-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>]
* Re: [PATCH v4 00/10] sched/psi: some optimizations and extensions
2022-09-06 13:13 ` Chengming Zhou
@ 2022-09-06 14:43 ` Peter Zijlstra
-1 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-09-06 14:43 UTC (permalink / raw)
To: Chengming Zhou
Cc: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
surenb-hpIqsD4AKlfQT0dZR+AlfA, mkoutny-IBi9RG/b67k,
mingo-H+wXaHxf7aLQT0dZR+AlfA,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg
On Tue, Sep 06, 2022 at 09:13:27PM +0800, Chengming Zhou wrote:
Ah, I see Johannes has acked them all, I missed that.
> > Chengming Zhou (9):
> > sched/psi: fix periodic aggregation shut off
> > sched/psi: don't create cgroup PSI files when psi_disabled
> > sched/psi: save percpu memory when !psi_cgroups_enabled
> > sched/psi: move private helpers to sched/stats.h
> > sched/psi: optimize task switch inside shared cgroups again
> > sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
> > sched/psi: consolidate cgroup_psi()
> > sched/psi: cache parent psi_group to speed up groups iterate
> > sched/psi: per-cgroup PSI accounting disable/re-enable interface
> >
> > Johannes Weiner (1):
> > sched/psi: remove NR_ONCPU task accounting
For future reference:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
Note all patches violate 1.2.2 for not starting the patch description
with a uppercase letter. I'll go manually fix up this time.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 00/10] sched/psi: some optimizations and extensions
@ 2022-09-06 14:43 ` Peter Zijlstra
0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-09-06 14:43 UTC (permalink / raw)
To: Chengming Zhou
Cc: hannes, tj, surenb, mkoutny, mingo, gregkh, corbet, cgroups,
linux-doc, linux-kernel, songmuchun
On Tue, Sep 06, 2022 at 09:13:27PM +0800, Chengming Zhou wrote:
Ah, I see Johannes has acked them all, I missed that.
> > Chengming Zhou (9):
> > sched/psi: fix periodic aggregation shut off
> > sched/psi: don't create cgroup PSI files when psi_disabled
> > sched/psi: save percpu memory when !psi_cgroups_enabled
> > sched/psi: move private helpers to sched/stats.h
> > sched/psi: optimize task switch inside shared cgroups again
> > sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
> > sched/psi: consolidate cgroup_psi()
> > sched/psi: cache parent psi_group to speed up groups iterate
> > sched/psi: per-cgroup PSI accounting disable/re-enable interface
> >
> > Johannes Weiner (1):
> > sched/psi: remove NR_ONCPU task accounting
For future reference:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
Note all patches violate 1.2.2 for not starting the patch description
with a uppercase letter. I'll go manually fix up this time.
^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <YxdcfX4Ss/9k8qA9-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCH v4 00/10] sched/psi: some optimizations and extensions
2022-09-06 14:43 ` Peter Zijlstra
@ 2022-09-07 1:55 ` Chengming Zhou
-1 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-09-07 1:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
surenb-hpIqsD4AKlfQT0dZR+AlfA, mkoutny-IBi9RG/b67k,
mingo-H+wXaHxf7aLQT0dZR+AlfA,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r, corbet-T1hC0tSOHrs,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
songmuchun-EC8Uxl6Npydl57MIdRCFDg
On 2022/9/6 22:43, Peter Zijlstra wrote:
> On Tue, Sep 06, 2022 at 09:13:27PM +0800, Chengming Zhou wrote:
>
> Ah, I see Johannes has acked them all, I missed that.
>
>>> Chengming Zhou (9):
>>> sched/psi: fix periodic aggregation shut off
>>> sched/psi: don't create cgroup PSI files when psi_disabled
>>> sched/psi: save percpu memory when !psi_cgroups_enabled
>>> sched/psi: move private helpers to sched/stats.h
>>> sched/psi: optimize task switch inside shared cgroups again
>>> sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
>>> sched/psi: consolidate cgroup_psi()
>>> sched/psi: cache parent psi_group to speed up groups iterate
>>> sched/psi: per-cgroup PSI accounting disable/re-enable interface
>>>
>>> Johannes Weiner (1):
>>> sched/psi: remove NR_ONCPU task accounting
>
> For future reference:
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> Note all patches violate 1.2.2 for not starting the patch description
> with a uppercase letter. I'll go manually fix up this time.
Sorry about that, thanks for the reference and your manual fix up!
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v4 00/10] sched/psi: some optimizations and extensions
@ 2022-09-07 1:55 ` Chengming Zhou
0 siblings, 0 replies; 38+ messages in thread
From: Chengming Zhou @ 2022-09-07 1:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: hannes, tj, surenb, mkoutny, mingo, gregkh, corbet, cgroups,
linux-doc, linux-kernel, songmuchun
On 2022/9/6 22:43, Peter Zijlstra wrote:
> On Tue, Sep 06, 2022 at 09:13:27PM +0800, Chengming Zhou wrote:
>
> Ah, I see Johannes has acked them all, I missed that.
>
>>> Chengming Zhou (9):
>>> sched/psi: fix periodic aggregation shut off
>>> sched/psi: don't create cgroup PSI files when psi_disabled
>>> sched/psi: save percpu memory when !psi_cgroups_enabled
>>> sched/psi: move private helpers to sched/stats.h
>>> sched/psi: optimize task switch inside shared cgroups again
>>> sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure
>>> sched/psi: consolidate cgroup_psi()
>>> sched/psi: cache parent psi_group to speed up groups iterate
>>> sched/psi: per-cgroup PSI accounting disable/re-enable interface
>>>
>>> Johannes Weiner (1):
>>> sched/psi: remove NR_ONCPU task accounting
>
> For future reference:
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> Note all patches violate 1.2.2 for not starting the patch description
> with a uppercase letter. I'll go manually fix up this time.
Sorry about that, thanks for the reference and your manual fix up!
^ permalink raw reply [flat|nested] 38+ messages in thread