* [PATCH 0/6] psi: slightly improve performance of psi
@ 2026-05-12 6:19 Luka Bai
2026-05-12 6:19 ` [PATCH 1/6] psi: move curr_in_memstall out of psi_group_change Luka Bai
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:19 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
PSI is useful for resource pressure monitoring. But the callbacks are
distributed among all the common calling paths, some of which are quite
performance critical. The hottest callback like psi_group_change is
called by both psi_task_switch and psi_task_change, which are parts of
task_switch, enqueue, dequeue. So the cpu usage of psi is quite
important.
We initialized a common hackbench test using the following command:
perf record --kernel-callchains -a -g hackbench -s 512 -P -g 10 -f 30 \
-l 1000 --pipe
In a machine setup with 8 cores, 16GB with two numa node(each node 8GB),
we saw a cpu usage of 4.3% for psi using the flame graph of the perf
data, which can make some observable influence to the actual workloads.
In this patchset, we did some improvement for the performance of hot
path, which slightly improves the performance for the psi. With a same
setup of 8 cores + 16GB, the cpu usage of psi becomes 3.4%, which has
a 20% improvement. In the future patches we may try to do more
adjustment to go further (Like add switches for different types of PSI
resources maybe).
Patch Details:
========
* Patch 1 moves the judgement of cpu_curr(cpu)->in_memstall from
psi_group_change outside to eliminate some repeated memory access.
* Patch 2 adds a bit variable need_psi to help judge whether we need
to do psi accouting for the cgroup. we move it and psi_flags, which
currently only has 5 bits, close to the bitfield variable in_memstall
together. This way they will be cacheline aligned together.
* Patch 3 adds a prefetch logic before actually accessing the parent
cgroups, since the parent cgroups will always be accessed in the
following step.
* Patch 4 only calls record_times when the state actually changes to
save some uncessary accesses.
* Patch 5 adds psi_group for the root cgroup to remove the uncessary
if condition.
* Patch 6 uses printk_deferred_once to replace the psi_bug variable
and moves tasks[NR_RUNNING] which is most likely to happen ahead
in the if condition.
Thanks for reading. Comments and suggestions are very welcome!
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
Luka Bai (6):
psi: move curr_in_memstall out of psi_group_change
psi: reorganize the psi members for cacheline benifits
psi: use prefetch to preread the parent groupc
psi: do not call record_times when the state is not changed
psi: add psi group for the root cgroup
psi: remove psi_bug and moves checking of NR_RUNNING ahead.
include/linux/psi.h | 2 +-
include/linux/psi_types.h | 20 +------------
include/linux/sched.h | 29 ++++++++++++++++---
kernel/cgroup/cgroup.c | 3 ++
kernel/fork.c | 10 +++++++
kernel/sched/psi.c | 71 ++++++++++++++++++++++++++++++-----------------
6 files changed, 85 insertions(+), 50 deletions(-)
---
base-commit: 972c53e0ec3abfc6f5fe2cb503640710fb23cf95
change-id: 20260512-psi_impr-f543a199f39d
Best regards,
--
Luka Bai <lukabai@tencent.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/6] psi: move curr_in_memstall out of psi_group_change
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
@ 2026-05-12 6:19 ` Luka Bai
2026-05-12 6:19 ` [PATCH 2/6] psi: reorganize the psi members for cacheline benifits Luka Bai
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:19 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
From: Luka Bai <lukabai@tencent.com>
Variable curr_in_memstall is currently judged by accessing the
in_memstall of cpu_curr(cpu), which contains multiple times of
memory accessing. And it is now located in psi_group_change()
that will be called for each parent cgroup and it is redundant
sometimes since its value will not change for all these parent
cgroups.
So we move the variable outside for two reasons:
1. We save the extra calling for each parent cgroup so we avoid
these possible uncessary cacheline stall.
2. For function like psi_task_switch, we don't need to call the
cpu_curr(cpu) to get the task that is currently running in
the cpu runqueue. Under that context, "next" is absolutely the
running task so we can save some costly calling.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
kernel/sched/psi.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d9c9d9480a45..27097cb0dc79 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -795,7 +795,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
static void psi_group_change(struct psi_group *group, int cpu,
unsigned int clear, unsigned int set,
- u64 now, bool wake_clock)
+ u64 now, bool wake_clock, bool curr_in_memstall)
{
struct psi_group_cpu *groupc;
unsigned int t, m;
@@ -868,7 +868,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
* task in a cgroup is in_memstall, the corresponding groupc
* on that cpu is in PSI_MEM_FULL state.
*/
- if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
+ if (unlikely((state_mask & PSI_ONCPU) && curr_in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
record_times(groupc, now);
@@ -910,6 +910,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
u64 now;
+ bool curr_in_memstall;
if (!task->pid)
return;
@@ -917,9 +918,11 @@ void psi_task_change(struct task_struct *task, int clear, int set)
psi_flags_change(task, clear, set);
psi_write_begin(cpu);
+ curr_in_memstall = cpu_curr(cpu)->in_memstall;
now = cpu_clock(cpu);
for_each_group(group, task_psi_group(task))
- psi_group_change(group, cpu, clear, set, now, true);
+ psi_group_change(group, cpu, clear, set, now, true,
+ curr_in_memstall);
psi_write_end(cpu);
}
@@ -929,11 +932,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
struct psi_group *common = NULL;
int cpu = task_cpu(prev);
u64 now;
+ bool curr_in_memstall = false;
psi_write_begin(cpu);
now = cpu_clock(cpu);
if (next->pid) {
+ curr_in_memstall = next->in_memstall;
psi_flags_change(next, 0, TSK_ONCPU);
/*
* Set TSK_ONCPU on @next's cgroups. If @next shares any
@@ -947,7 +952,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
common = group;
break;
}
- psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
+ psi_group_change(group, cpu, 0, TSK_ONCPU, now, true,
+ curr_in_memstall);
}
}
@@ -984,7 +990,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
for_each_group(group, task_psi_group(prev)) {
if (group == common)
break;
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, wake_clock,
+ curr_in_memstall);
}
/*
@@ -996,7 +1003,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
clear &= ~TSK_ONCPU;
for_each_group(group, common)
- psi_group_change(group, cpu, clear, set, now, wake_clock);
+ psi_group_change(group, cpu, clear, set, now, wake_clock,
+ curr_in_memstall);
}
}
psi_write_end(cpu);
@@ -1236,7 +1244,8 @@ void psi_cgroup_restart(struct psi_group *group)
psi_write_begin(cpu);
now = cpu_clock(cpu);
- psi_group_change(group, cpu, 0, 0, now, true);
+ psi_group_change(group, cpu, 0, 0, now, true,
+ cpu_curr(cpu)->in_memstall);
psi_write_end(cpu);
}
}
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/6] psi: reorganize the psi members for cacheline benifits
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
2026-05-12 6:19 ` [PATCH 1/6] psi: move curr_in_memstall out of psi_group_change Luka Bai
@ 2026-05-12 6:19 ` Luka Bai
2026-05-12 6:19 ` [PATCH 3/6] psi: use prefetch to preread the parent groupc Luka Bai
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:19 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
From: Luka Bai <lukabai@tencent.com>
Currently, we check whether the task needs to do psi accounting by
reading task->pid, which is not cacheline aligned with other psi
variables like in_memstall. This can generate some cacheline stall
from what perf-record indicates. So we would like to merge these
variables together.
However, directly switching order of pid and restart_block may cause
other cacheline problem in other scenorios which is hard to recognize
clearly. So we added need_psi bitfield variable to indicate the same psi
thing and put it together with in_memstall. The value of need_psi will
not be changed ever since the task gets created so there is no problem
about synchronization. Also, adding one bit to the bitfield variable
of unsigned int will not enlarge the size of task_struct or change the
memory pattern of task_struct at all.
Also, we put psi_flags which only has 5 bits long together with
in_memstall and need_psi too to make them all cacheline optimized.
5 extra bits can also be stuffed into one single unsigned int so it
will also not enlarge the size of task_struct, but on the contrary,
it will shrink the task_struct since we eliminate the psi_flags that
was put there independently as a unsigned int.
We also add NR_TSK_ONCPU and NR_PSI_ALL_COUNTS into the psi_task_count
enum definition to make the semantics clearer, and move the definition
from linux/psi_types.h into linux/sched.h since we need those enums in
linux/sched.h. These two revisions do not make any actual funtional
difference to the code.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
include/linux/psi_types.h | 20 +-------------------
include/linux/sched.h | 29 +++++++++++++++++++++++++----
kernel/fork.c | 10 ++++++++++
kernel/sched/psi.c | 6 +++---
4 files changed, 39 insertions(+), 26 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index dd10c22299ab..5639dcdd90af 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -10,24 +10,6 @@
#ifdef CONFIG_PSI
-/* Tracked task states */
-enum psi_task_count {
- NR_IOWAIT,
- NR_MEMSTALL,
- NR_RUNNING,
- /*
- * For IO and CPU stalls the presence of running/oncpu tasks
- * in the domain means a partial rather than a full stall.
- * For memory it's not so simple because of page reclaimers:
- * they are running/oncpu while representing a stall. To tell
- * whether a domain has productivity left or not, we need to
- * distinguish between regular running (i.e. productive)
- * threads and memstall ones.
- */
- NR_MEMSTALL_RUNNING,
- NR_PSI_TASK_COUNTS = 4,
-};
-
/* Task state bitmasks */
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
@@ -35,7 +17,7 @@ enum psi_task_count {
#define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
/* Only one task can be scheduled, no corresponding task count */
-#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
+#define TSK_ONCPU (1 << NR_TSK_ONCPU)
/* Resources that workloads could be stalled on */
enum psi_res {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..34d7f80531e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -817,6 +817,28 @@ struct kmap_ctrl {
#endif
};
+#ifdef CONFIG_PSI
+/* Tracked task states */
+enum psi_task_count {
+ NR_IOWAIT,
+ NR_MEMSTALL,
+ NR_RUNNING,
+ /*
+ * For IO and CPU stalls the presence of running/oncpu tasks
+ * in the domain means a partial rather than a full stall.
+ * For memory it's not so simple because of page reclaimers:
+ * they are running/oncpu while representing a stall. To tell
+ * whether a domain has productivity left or not, we need to
+ * distinguish between regular running (i.e. productive)
+ * threads and memstall ones.
+ */
+ NR_MEMSTALL_RUNNING,
+ NR_PSI_TASK_COUNTS,
+ NR_TSK_ONCPU = NR_PSI_TASK_COUNTS,
+ NR_PSI_ALL_COUNTS,
+};
+#endif
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1030,6 +1052,9 @@ struct task_struct {
#ifdef CONFIG_PSI
/* Stalled due to lack of memory */
unsigned in_memstall:1;
+ unsigned need_psi:1;
+ /* Pressure stall state */
+ unsigned psi_flags:NR_PSI_ALL_COUNTS;
#endif
#ifdef CONFIG_PAGE_OWNER
/* Used by page_owner=on to detect recursion in page tracking. */
@@ -1299,10 +1324,6 @@ struct task_struct {
kernel_siginfo_t *last_siginfo;
struct task_io_accounting ioac;
-#ifdef CONFIG_PSI
- /* Pressure stall state */
- unsigned int psi_flags;
-#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d97fd71d7f6..20b47c876b27 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2177,6 +2177,16 @@ __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_PSI
p->psi_flags = 0;
+ /*
+ * Only setup need_psi to 1 for non-idle tasks. We
+ * also need to reset need_psi of idle tasks to 0 since
+ * their values are copied from the init task whose
+ * need_psi is not 0.
+ */
+ if (pid != &init_struct_pid)
+ p->need_psi = 1;
+ else
+ p->need_psi = 0;
#endif
task_io_accounting_init(&p->ioac);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 27097cb0dc79..7374c05a5751 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -912,7 +912,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
u64 now;
bool curr_in_memstall;
- if (!task->pid)
+ if (!task->need_psi)
return;
psi_flags_change(task, clear, set);
@@ -937,7 +937,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_write_begin(cpu);
now = cpu_clock(cpu);
- if (next->pid) {
+ if (next->need_psi) {
curr_in_memstall = next->in_memstall;
psi_flags_change(next, 0, TSK_ONCPU);
/*
@@ -957,7 +957,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
- if (prev->pid) {
+ if (prev->need_psi) {
int clear = TSK_ONCPU, set = 0;
bool wake_clock = true;
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/6] psi: use prefetch to preread the parent groupc
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
2026-05-12 6:19 ` [PATCH 1/6] psi: move curr_in_memstall out of psi_group_change Luka Bai
2026-05-12 6:19 ` [PATCH 2/6] psi: reorganize the psi members for cacheline benifits Luka Bai
@ 2026-05-12 6:19 ` Luka Bai
2026-05-12 6:20 ` [PATCH 4/6] psi: do not call record_times when the state is not changed Luka Bai
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:19 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
From: Luka Bai <lukabai@tencent.com>
When doing psi_group_change, we always iterate all the cgroups from
the child all the way up to the root cgroup. They are all double link
list connected so it's hard for the CPU to prefetch this parent. So
we tried to add a prefetch for the parent groupc, and it has quite some
benefits for the final result.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
kernel/sched/psi.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 7374c05a5751..9b7a85d1bc28 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -793,6 +793,15 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
#define for_each_group(iter, group) \
for (typeof(group) iter = group; iter; iter = iter->parent)
+static inline struct psi_group_cpu *prefetch_and_get_groupc(struct psi_group *group, int cpu)
+{
+ struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ if (group->parent)
+ prefetchw(per_cpu_ptr(group->parent->pcpu, cpu));
+ return groupc;
+}
+
static void psi_group_change(struct psi_group *group, int cpu,
unsigned int clear, unsigned int set,
u64 now, bool wake_clock, bool curr_in_memstall)
@@ -802,7 +811,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
u32 state_mask;
lockdep_assert_rq_held(cpu_rq(cpu));
- groupc = per_cpu_ptr(group->pcpu, cpu);
+ groupc = prefetch_and_get_groupc(group, cpu);
/*
* Start with TSK_ONCPU, which doesn't have a corresponding
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 4/6] psi: do not call record_times when the state is not changed
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
` (2 preceding siblings ...)
2026-05-12 6:19 ` [PATCH 3/6] psi: use prefetch to preread the parent groupc Luka Bai
@ 2026-05-12 6:20 ` Luka Bai
2026-05-12 6:20 ` [PATCH 5/6] psi: add psi group for the root cgroup Luka Bai
2026-05-12 6:20 ` [PATCH 6/6] psi: remove psi_bug and moves checking of NR_RUNNING ahead Luka Bai
5 siblings, 0 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
From: Luka Bai <lukabai@tencent.com>
In psi_group_change, record_times is always called no matter whether
the state_mask changes. Since it can cost some performance, we
choose to not to do it unconditionally. If the state has not changed,
we can keep the psi time unchanged.
This will not make any difference to the final result since when
we need to acquire the psi time, get_recent_times() will always
calculate the remaining time into the final result.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
kernel/sched/psi.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 9b7a85d1bc28..4c4bd134c785 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -880,9 +880,15 @@ static void psi_group_change(struct psi_group *group, int cpu,
if (unlikely((state_mask & PSI_ONCPU) && curr_in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
- record_times(groupc, now);
-
- groupc->state_mask = state_mask;
+ /*
+ * We only need to record times when the state changes. Or
+ * we can keep it unchanged and wait for get_recent_times()
+ * to handle the remaining time.
+ */
+ if (state_mask != groupc->state_mask) {
+ record_times(groupc, now);
+ groupc->state_mask = state_mask;
+ }
if (state_mask & group->rtpoll_states)
psi_schedule_rtpoll_work(group, 1, false);
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 5/6] psi: add psi group for the root cgroup
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
` (3 preceding siblings ...)
2026-05-12 6:20 ` [PATCH 4/6] psi: do not call record_times when the state is not changed Luka Bai
@ 2026-05-12 6:20 ` Luka Bai
2026-05-12 6:20 ` [PATCH 6/6] psi: remove psi_bug and moves checking of NR_RUNNING ahead Luka Bai
5 siblings, 0 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
From: Luka Bai <lukabai@tencent.com>
Cgroup_psi() now includes a condition, and checks against whether
the cgroup is the root cgroup to decide whether to use psi_system
instead of cgrp->psi. This is mostly because the default hierarchy
does not have any psi group attached. So we make psi_system as
its psi group, and remove the if condition in cgroup_psi().
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
include/linux/psi.h | 2 +-
kernel/cgroup/cgroup.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h
index e0745873e3f2..8f2db511d051 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -34,7 +34,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
#ifdef CONFIG_CGROUPS
static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
{
- return cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi;
+ return cgrp->psi;
}
int psi_cgroup_alloc(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 43adc96c7f1a..357c68662d18 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -178,6 +178,9 @@ static DEFINE_PER_CPU(struct cgroup_rstat_base_cpu, root_rstat_base_cpu);
/* the default hierarchy */
struct cgroup_root cgrp_dfl_root = {
.cgrp.self.rstat_cpu = &root_rstat_cpu,
+#ifdef CONFIG_PSI
+ .cgrp.psi = &psi_system,
+#endif
.cgrp.rstat_base_cpu = &root_rstat_base_cpu,
};
EXPORT_SYMBOL_GPL(cgrp_dfl_root);
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 6/6] psi: remove psi_bug and moves checking of NR_RUNNING ahead.
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
` (4 preceding siblings ...)
2026-05-12 6:20 ` [PATCH 5/6] psi: add psi group for the root cgroup Luka Bai
@ 2026-05-12 6:20 ` Luka Bai
5 siblings, 0 replies; 7+ messages in thread
From: Luka Bai @ 2026-05-12 6:20 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Suren Baghdasaryan, Peter Ziljstra, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Kees Cook, Tejun Heo, Michal Koutný, linux-kernel, cgroups,
Luka Bai
From: Luka Bai <lukabai@tencent.com>
During the accounting of psi states we'd like to do some bug detection
to make it more maintainable. And we use the variable psi_bug to make
it print once. We would like to use printk_deferred_once to replace the
usage of psi_bug since their effect are similar, and this can also
increase the readability.
Also, use likely and unlikely in these bug detection branches.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
kernel/sched/psi.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4c4bd134c785..70dd642af5e0 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -141,8 +141,6 @@
#include <linux/psi.h>
#include "sched.h"
-static int psi_bug __read_mostly;
-
DEFINE_STATIC_KEY_FALSE(psi_disabled);
static DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled);
@@ -262,7 +260,7 @@ static u32 test_states(unsigned int *tasks, u32 state_mask)
if (tasks[NR_RUNNING] && !oncpu)
state_mask |= BIT(PSI_CPU_FULL);
- if (tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING])
+ if (tasks[NR_RUNNING] || tasks[NR_MEMSTALL] || tasks[NR_IOWAIT])
state_mask |= BIT(PSI_NONIDLE);
return state_mask;
@@ -836,14 +834,13 @@ static void psi_group_change(struct psi_group *group, int cpu,
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
continue;
- if (groupc->tasks[t]) {
+ if (likely(groupc->tasks[t])) {
groupc->tasks[t]--;
- } else if (!psi_bug) {
- printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
+ } else {
+ printk_deferred_once("psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
cpu, t, groupc->tasks[0],
groupc->tasks[1], groupc->tasks[2],
groupc->tasks[3], clear, set);
- psi_bug = 1;
}
}
@@ -908,13 +905,11 @@ static inline struct psi_group *task_psi_group(struct task_struct *task)
static void psi_flags_change(struct task_struct *task, int clear, int set)
{
- if (((task->psi_flags & set) ||
- (task->psi_flags & clear) != clear) &&
- !psi_bug) {
- printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
+ if (unlikely(((task->psi_flags & set) ||
+ (task->psi_flags & clear) != clear))) {
+ printk_deferred_once("psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
task->pid, task->comm, task_cpu(task),
task->psi_flags, clear, set);
- psi_bug = 1;
}
task->psi_flags &= ~clear;
@@ -927,7 +922,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
u64 now;
bool curr_in_memstall;
- if (!task->need_psi)
+ if (unlikely(!task->need_psi))
return;
psi_flags_change(task, clear, set);
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-05-12 6:21 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
2026-05-12 6:19 ` [PATCH 1/6] psi: move curr_in_memstall out of psi_group_change Luka Bai
2026-05-12 6:19 ` [PATCH 2/6] psi: reorganize the psi members for cacheline benifits Luka Bai
2026-05-12 6:19 ` [PATCH 3/6] psi: use prefetch to preread the parent groupc Luka Bai
2026-05-12 6:20 ` [PATCH 4/6] psi: do not call record_times when the state is not changed Luka Bai
2026-05-12 6:20 ` [PATCH 5/6] psi: add psi group for the root cgroup Luka Bai
2026-05-12 6:20 ` [PATCH 6/6] psi: remove psi_bug and moves checking of NR_RUNNING ahead Luka Bai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox