From: Luka Bai <lukafocus@icloud.com>
To: linux-mm@kvack.org
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
"Suren Baghdasaryan" <surenb@google.com>,
"Peter Ziljstra" <peterz@infradead.org>,
"Ingo Molnar" <mingo@redhat.com>,
"Juri Lelli" <juri.lelli@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
"Valentin Schneider" <vschneid@redhat.com>,
"K Prateek Nayak" <kprateek.nayak@amd.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"David Hildenbrand" <david@kernel.org>,
"Lorenzo Stoakes" <ljs@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
"Vlastimil Babka" <vbabka@kernel.org>,
"Mike Rapoport" <rppt@kernel.org>,
"Michal Hocko" <mhocko@suse.com>, "Kees Cook" <kees@kernel.org>,
"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
"Luka Bai" <lukabai@tencent.com>
Subject: [PATCH 2/6] psi: reorganize the psi members for cacheline benifits
Date: Tue, 12 May 2026 14:19:58 +0800 [thread overview]
Message-ID: <20260512-psi_impr-v1-2-2b7f10fdfad5@tencent.com> (raw)
In-Reply-To: <20260512-psi_impr-v1-0-2b7f10fdfad5@tencent.com>
From: Luka Bai <lukabai@tencent.com>
Currently, we check whether the task needs to do psi accounting by
reading task->pid, which is not cacheline aligned with other psi
variables like in_memstall. This can generate some cacheline stall
from what perf-record indicates. So we would like to merge these
variables together.
However, directly switching order of pid and restart_block may cause
other cacheline problem in other scenorios which is hard to recognize
clearly. So we added need_psi bitfield variable to indicate the same psi
thing and put it together with in_memstall. The value of need_psi will
not be changed ever since the task gets created so there is no problem
about synchronization. Also, adding one bit to the bitfield variable
of unsigned int will not enlarge the size of task_struct or change the
memory pattern of task_struct at all.
Also, we put psi_flags which only has 5 bits long together with
in_memstall and need_psi too to make them all cacheline optimized.
5 extra bits can also be stuffed into one single unsigned int so it
will also not enlarge the size of task_struct, but on the contrary,
it will shrink the task_struct since we eliminate the psi_flags that
was put there independently as a unsigned int.
We also add NR_TSK_ONCPU and NR_PSI_ALL_COUNTS into the psi_task_count
enum definition to make the semantics clearer, and move the definition
from linux/psi_types.h into linux/sched.h since we need those enums in
linux/sched.h. These two revisions do not make any actual funtional
difference to the code.
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
include/linux/psi_types.h | 20 +-------------------
include/linux/sched.h | 29 +++++++++++++++++++++++++----
kernel/fork.c | 10 ++++++++++
kernel/sched/psi.c | 6 +++---
4 files changed, 39 insertions(+), 26 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index dd10c22299ab..5639dcdd90af 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -10,24 +10,6 @@
#ifdef CONFIG_PSI
-/* Tracked task states */
-enum psi_task_count {
- NR_IOWAIT,
- NR_MEMSTALL,
- NR_RUNNING,
- /*
- * For IO and CPU stalls the presence of running/oncpu tasks
- * in the domain means a partial rather than a full stall.
- * For memory it's not so simple because of page reclaimers:
- * they are running/oncpu while representing a stall. To tell
- * whether a domain has productivity left or not, we need to
- * distinguish between regular running (i.e. productive)
- * threads and memstall ones.
- */
- NR_MEMSTALL_RUNNING,
- NR_PSI_TASK_COUNTS = 4,
-};
-
/* Task state bitmasks */
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
@@ -35,7 +17,7 @@ enum psi_task_count {
#define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING)
/* Only one task can be scheduled, no corresponding task count */
-#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS)
+#define TSK_ONCPU (1 << NR_TSK_ONCPU)
/* Resources that workloads could be stalled on */
enum psi_res {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..34d7f80531e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -817,6 +817,28 @@ struct kmap_ctrl {
#endif
};
+#ifdef CONFIG_PSI
+/* Tracked task states */
+enum psi_task_count {
+ NR_IOWAIT,
+ NR_MEMSTALL,
+ NR_RUNNING,
+ /*
+ * For IO and CPU stalls the presence of running/oncpu tasks
+ * in the domain means a partial rather than a full stall.
+ * For memory it's not so simple because of page reclaimers:
+ * they are running/oncpu while representing a stall. To tell
+ * whether a domain has productivity left or not, we need to
+ * distinguish between regular running (i.e. productive)
+ * threads and memstall ones.
+ */
+ NR_MEMSTALL_RUNNING,
+ NR_PSI_TASK_COUNTS,
+ NR_TSK_ONCPU = NR_PSI_TASK_COUNTS,
+ NR_PSI_ALL_COUNTS,
+};
+#endif
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1030,6 +1052,9 @@ struct task_struct {
#ifdef CONFIG_PSI
/* Stalled due to lack of memory */
unsigned in_memstall:1;
+ unsigned need_psi:1;
+ /* Pressure stall state */
+ unsigned psi_flags:NR_PSI_ALL_COUNTS;
#endif
#ifdef CONFIG_PAGE_OWNER
/* Used by page_owner=on to detect recursion in page tracking. */
@@ -1299,10 +1324,6 @@ struct task_struct {
kernel_siginfo_t *last_siginfo;
struct task_io_accounting ioac;
-#ifdef CONFIG_PSI
- /* Pressure stall state */
- unsigned int psi_flags;
-#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d97fd71d7f6..20b47c876b27 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2177,6 +2177,16 @@ __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_PSI
p->psi_flags = 0;
+ /*
+ * Only setup need_psi to 1 for non-idle tasks. We
+ * also need to reset need_psi of idle tasks to 0 since
+ * their values are copied from the init task whose
+ * need_psi is not 0.
+ */
+ if (pid != &init_struct_pid)
+ p->need_psi = 1;
+ else
+ p->need_psi = 0;
#endif
task_io_accounting_init(&p->ioac);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 27097cb0dc79..7374c05a5751 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -912,7 +912,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
u64 now;
bool curr_in_memstall;
- if (!task->pid)
+ if (!task->need_psi)
return;
psi_flags_change(task, clear, set);
@@ -937,7 +937,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
psi_write_begin(cpu);
now = cpu_clock(cpu);
- if (next->pid) {
+ if (next->need_psi) {
curr_in_memstall = next->in_memstall;
psi_flags_change(next, 0, TSK_ONCPU);
/*
@@ -957,7 +957,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}
}
- if (prev->pid) {
+ if (prev->need_psi) {
int clear = TSK_ONCPU, set = 0;
bool wake_clock = true;
--
2.52.0
next prev parent reply other threads:[~2026-05-12 6:20 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-12 6:19 [PATCH 0/6] psi: slightly improve performance of psi Luka Bai
2026-05-12 6:19 ` [PATCH 1/6] psi: move curr_in_memstall out of psi_group_change Luka Bai
2026-05-12 6:19 ` Luka Bai [this message]
2026-05-12 6:19 ` [PATCH 3/6] psi: use prefetch to preread the parent groupc Luka Bai
2026-05-12 6:20 ` [PATCH 4/6] psi: do not call record_times when the state is not changed Luka Bai
2026-05-12 6:20 ` [PATCH 5/6] psi: add psi group for the root cgroup Luka Bai
2026-05-12 6:20 ` [PATCH 6/6] psi: remove psi_bug and moves checking of NR_RUNNING ahead Luka Bai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260512-psi_impr-v1-2-2b7f10fdfad5@tencent.com \
--to=lukafocus@icloud.com \
--cc=akpm@linux-foundation.org \
--cc=bsegall@google.com \
--cc=cgroups@vger.kernel.org \
--cc=david@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=kees@kernel.org \
--cc=kprateek.nayak@amd.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=lukabai@tencent.com \
--cc=mgorman@suse.de \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox