[PATCHSET v4 0/4] Split iowait into two states

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v4 0/4] Split iowait into two states
@ 2024-04-16 12:11 Jens Axboe
  2024-04-16 12:11 ` [PATCH 1/4] sched/core: add helpers for iowait handling Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Jens Axboe @ 2024-04-16 12:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx

Hi,

This is v3 of the patchset where the current in_iowait state is split
into two parts:

1) The "task is sleeping waiting on IO", and would like cpufreq goodness
   in terms of sleep and wakeup latencies.
2) The above, and also accounted as such in the iowait stats.

The current ->in_iowait covers both, this series splits it into two types
of state so that each can be controlled seperately.

Patches 1..3 are prep patches, changing the type of
task_struct->nr_iowait and adding helpers to manipulate the iowait counts.

Patch 4 does the actual splitting.

Comments welcome! Peter, CC'ing you since I did on the previous posting,
feel free to ignore.

Since v3:
- Move to an atomic_long_t and drop the locking required by encoding
  both the iowait and iowait_acct state in the same variable. Suggested
  by Thomas.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] sched/core: add helpers for iowait handling
  2024-04-16 12:11 [PATCHSET v4 0/4] Split iowait into two states Jens Axboe
@ 2024-04-16 12:11 ` Jens Axboe
  2024-04-16 12:11 ` [PATCH 2/4] sched/core: change rq->nr_iowait type to atomic_long_t on 64-bit Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2024-04-16 12:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, Jens Axboe

Adds helpers to inc/dec the runqueue iowait count, based on the task, and
use those in the spots where the count is manipulated.

Adds an rq_iowait() helper, to abstract out how the per-rq stats are read.

No functional changes in this patch, just in preparation for switching
the type of 'nr_iowait'.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 kernel/sched/core.c    | 23 +++++++++++++++++++----
 kernel/sched/cputime.c |  3 +--
 kernel/sched/sched.h   |  2 ++
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7019a40457a6..977bb08a33d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3721,6 +3721,21 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
 
 #endif /* !CONFIG_SMP */
 
+static void task_iowait_inc(struct task_struct *p)
+{
+	atomic_inc(&task_rq(p)->nr_iowait);
+}
+
+static void task_iowait_dec(struct task_struct *p)
+{
+	atomic_dec(&task_rq(p)->nr_iowait);
+}
+
+int rq_iowait(struct rq *rq)
+{
+	return atomic_read(&rq->nr_iowait);
+}
+
 static void
 ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 {
@@ -3787,7 +3802,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 #endif
 	if (p->in_iowait) {
 		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
+		task_iowait_dec(p);
 	}
 
 	activate_task(rq, p, en_flags);
@@ -4364,7 +4379,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
 				delayacct_blkio_end(p);
-				atomic_dec(&task_rq(p)->nr_iowait);
+				task_iowait_dec(p);
 			}
 
 			wake_flags |= WF_MIGRATED;
@@ -5472,7 +5487,7 @@ unsigned long long nr_context_switches(void)
 
 unsigned int nr_iowait_cpu(int cpu)
 {
-	return atomic_read(&cpu_rq(cpu)->nr_iowait);
+	return rq_iowait(cpu_rq(cpu));
 }
 
 /*
@@ -6692,7 +6707,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
 
 			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
+				task_iowait_inc(prev);
 				delayacct_blkio_start();
 			}
 		}
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..7d9423df7779 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -222,9 +222,8 @@ void account_steal_time(u64 cputime)
 void account_idle_time(u64 cputime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
-	struct rq *rq = this_rq();
 
-	if (atomic_read(&rq->nr_iowait) > 0)
+	if (rq_iowait(this_rq()) > 0)
 		cpustat[CPUTIME_IOWAIT] += cputime;
 	else
 		cpustat[CPUTIME_IDLE] += cputime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..387f67ddf18a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3473,4 +3473,6 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
+int rq_iowait(struct rq *rq);
+
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] sched/core: change rq->nr_iowait type to atomic_long_t on 64-bit
  2024-04-16 12:11 [PATCHSET v4 0/4] Split iowait into two states Jens Axboe
  2024-04-16 12:11 ` [PATCH 1/4] sched/core: add helpers for iowait handling Jens Axboe
@ 2024-04-16 12:11 ` Jens Axboe
  2024-04-24  9:56   ` Peter Zijlstra
  2024-04-16 12:11 ` [PATCH 3/4] sched/core: have io_schedule_prepare() return a long Jens Axboe
  2024-04-16 12:11 ` [PATCH 4/4] sched/core: split iowait state into two states Jens Axboe
  3 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2024-04-16 12:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, Jens Axboe

In preparation for storing two separate iowait states in there, bump the
size from a 32-bit to a 64-bit size, for 64-bit kernels.

Note that on 32-bit, the number of tasks are limited to 0x8000, which
fits just fine in even half of the existiing 32-bit atomic_t. For 64-bit,
no such limit exists, hence play it safe and make it a 64-bit atomic.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 kernel/sched/core.c  | 14 +++++++++++++-
 kernel/sched/sched.h |  4 ++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 977bb08a33d2..6a6c985220b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3723,17 +3723,29 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
 
 static void task_iowait_inc(struct task_struct *p)
 {
+#ifdef CONFIG_64BIT
+	atomic_long_inc(&task_rq(p)->nr_iowait);
+#else
 	atomic_inc(&task_rq(p)->nr_iowait);
+#endif
 }
 
 static void task_iowait_dec(struct task_struct *p)
 {
+#ifdef CONFIG_64BIT
+	atomic_long_dec(&task_rq(p)->nr_iowait);
+#else
 	atomic_dec(&task_rq(p)->nr_iowait);
+#endif
 }
 
 int rq_iowait(struct rq *rq)
 {
+#ifdef CONFIG_64BIT
+	return atomic_long_read(&rq->nr_iowait);
+#else
 	return atomic_read(&rq->nr_iowait);
+#endif
 }
 
 static void
@@ -10065,7 +10077,7 @@ void __init sched_init(void)
 #endif
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
-		atomic_set(&rq->nr_iowait, 0);
+		atomic_long_set(&rq->nr_iowait, 0);
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 387f67ddf18a..c2802d066615 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1049,7 +1049,11 @@ struct rq {
 	u64			clock_idle_copy;
 #endif
 
+#ifdef CONFIG_64BIT
+	atomic_long_t		nr_iowait;
+#else
 	atomic_t		nr_iowait;
+#endif
 
 #ifdef CONFIG_SCHED_DEBUG
 	u64 last_seen_need_resched_ns;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] sched/core: have io_schedule_prepare() return a long
  2024-04-16 12:11 [PATCHSET v4 0/4] Split iowait into two states Jens Axboe
  2024-04-16 12:11 ` [PATCH 1/4] sched/core: add helpers for iowait handling Jens Axboe
  2024-04-16 12:11 ` [PATCH 2/4] sched/core: change rq->nr_iowait type to atomic_long_t on 64-bit Jens Axboe
@ 2024-04-16 12:11 ` Jens Axboe
  2024-04-16 12:11 ` [PATCH 4/4] sched/core: split iowait state into two states Jens Axboe
  3 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2024-04-16 12:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, Jens Axboe

In preparation for needing more state then 32-bit on 64-bit archs,
switch it to a long instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c           | 2 +-
 include/linux/sched.h        | 4 ++--
 kernel/locking/mutex.c       | 4 ++--
 kernel/locking/rtmutex_api.c | 4 ++--
 kernel/sched/core.c          | 6 +++---
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bdbb557feb5a..77faceddd5dd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1849,7 +1849,7 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
 	u64 now = blk_time_get_ns();
 	u64 exp;
 	u64 delay_nsec = 0;
-	int tok;
+	long tok;
 
 	while (blkg->parent) {
 		int use_delay = atomic_read(&blkg->use_delay);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c2abbc587b4..dcfc2830ed8e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -317,8 +317,8 @@ asmlinkage void preempt_schedule_irq(void);
  extern void schedule_rtlock(void);
 #endif
 
-extern int __must_check io_schedule_prepare(void);
-extern void io_schedule_finish(int token);
+extern long __must_check io_schedule_prepare(void);
+extern void io_schedule_finish(long token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cbae8c0b89ab..4a86ea6c7f19 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -830,7 +830,7 @@ EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
 void __sched
 mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
 {
-	int token;
+	long token;
 
 	might_sleep();
 
@@ -1026,7 +1026,7 @@ EXPORT_SYMBOL(mutex_lock_killable);
  */
 void __sched mutex_lock_io(struct mutex *lock)
 {
-	int token;
+	long token;
 
 	token = io_schedule_prepare();
 	mutex_lock(lock);
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index a6974d044593..ddf7f7f3f0b5 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -547,7 +547,7 @@ EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
 void __sched mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
 {
-	int token;
+	long token;
 
 	might_sleep();
 
@@ -579,7 +579,7 @@ EXPORT_SYMBOL(mutex_lock_killable);
 
 void __sched mutex_lock_io(struct mutex *lock)
 {
-	int token = io_schedule_prepare();
+	long token = io_schedule_prepare();
 
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
 	io_schedule_finish(token);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a6c985220b1..63f6d44f460c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9032,16 +9032,16 @@ int __sched yield_to(struct task_struct *p, bool preempt)
 }
 EXPORT_SYMBOL_GPL(yield_to);
 
-int io_schedule_prepare(void)
+long io_schedule_prepare(void)
 {
-	int old_iowait = current->in_iowait;
+	long old_iowait = current->in_iowait;
 
 	current->in_iowait = 1;
 	blk_flush_plug(current->plug, true);
 	return old_iowait;
 }
 
-void io_schedule_finish(int token)
+void io_schedule_finish(long token)
 {
 	current->in_iowait = token;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-16 12:11 [PATCHSET v4 0/4] Split iowait into two states Jens Axboe
                   ` (2 preceding siblings ...)
  2024-04-16 12:11 ` [PATCH 3/4] sched/core: have io_schedule_prepare() return a long Jens Axboe
@ 2024-04-16 12:11 ` Jens Axboe
  2024-04-16 14:10   ` Christian Loehle
  2024-04-24 10:01   ` Peter Zijlstra
  3 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2024-04-16 12:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, Jens Axboe

iowait is a bogus metric, but it's helpful in the sense that it allows
short waits to not enter sleep states that have a higher exit latency
than would've otherwise have been picked for iowait'ing tasks. However,
it's harmless in that lots of applications and monitoring assumes that
iowait is busy time, or otherwise use it as a health metric.
Particularly for async IO it's entirely nonsensical.

Split the iowait part into two parts - one that tracks whether the task
needs boosting for short waits, and one that marks whether the task
needs to be accounted as such. ->in_iowait_acct nests inside of
->in_iowait, both for efficiency reasons, but also so that the
relationship between the two is clear. A waiter may set ->in_wait alone
and not care about the accounting.

Existing users of nr_iowait() for accounting purposes are switched to
use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
only cares about iowaiters, not the accounting side.

Utilize that there's enough space in rq->nr_iowait to store both values
in there, shifting the accounting side by half the size of the type.
Thank you to Thomas Gleixner for that [1] suggestion.

[1] https://lore.kernel.org/lkml/87sf1b6o9w.ffs@tglx/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/s390/appldata/appldata_base.c |  2 +-
 arch/s390/appldata/appldata_os.c   |  2 +-
 fs/proc/stat.c                     |  2 +-
 include/linux/sched.h              |  6 +++
 include/linux/sched/stat.h         |  5 ++-
 kernel/sched/core.c                | 62 +++++++++++++++++++++++-------
 kernel/sched/sched.h               |  1 +
 kernel/time/tick-sched.c           |  6 +--
 8 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c
index c2978cb03b36..6844b5294a8b 100644
--- a/arch/s390/appldata/appldata_base.c
+++ b/arch/s390/appldata/appldata_base.c
@@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo);
 #endif
 EXPORT_SYMBOL_GPL(nr_threads);
 EXPORT_SYMBOL_GPL(nr_running);
-EXPORT_SYMBOL_GPL(nr_iowait);
+EXPORT_SYMBOL_GPL(nr_iowait_acct);
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index a363d30ce739..fa4b278aca6c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data)
 
 	os_data->nr_threads = nr_threads;
 	os_data->nr_running = nr_running();
-	os_data->nr_iowait  = nr_iowait();
+	os_data->nr_iowait  = nr_iowait_acct();
 	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
 	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
 	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index da60956b2915..149be7a884fb 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)boottime.tv_sec,
 		total_forks,
 		nr_running(),
-		nr_iowait());
+		nr_iowait_acct());
 
 	seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dcfc2830ed8e..26c69db5484b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -924,7 +924,13 @@ struct task_struct {
 
 	/* Bit to tell TOMOYO we're in execve(): */
 	unsigned			in_execve:1;
+	/* task is in iowait */
 	unsigned			in_iowait:1;
+	/*
+	 * task is in iowait and should be accounted as such. can only be set
+	 * if ->in_iowait is also set.
+	 */
+	unsigned			in_iowait_acct:1;
 #ifndef TIF_RESTORE_SIGMASK
 	unsigned			restore_sigmask:1;
 #endif
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 0108a38bb64d..31e8a44b3d71 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -19,8 +19,9 @@ DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
 extern unsigned int nr_running(void);
 extern bool single_task_running(void);
-extern unsigned int nr_iowait(void);
-extern unsigned int nr_iowait_cpu(int cpu);
+unsigned int nr_iowait_acct(void);
+unsigned int nr_iowait_acct_cpu(int cpu);
+unsigned int nr_iowait_cpu(int cpu);
 
 static inline int sched_info_on(void)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63f6d44f460c..d52d3118dd73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3721,30 +3721,47 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
 
 #endif /* !CONFIG_SMP */
 
+/*
+ * Store iowait and iowait_acct state in the same variable. The lower bits
+ * hold the iowait state, and the upper bits hold the iowait_acct state.
+ */
 static void task_iowait_inc(struct task_struct *p)
 {
 #ifdef CONFIG_64BIT
-	atomic_long_inc(&task_rq(p)->nr_iowait);
+	long val = 1 + ((long) p->in_iowait_acct << 32);
+	atomic_long_add(val, &task_rq(p)->nr_iowait);
 #else
-	atomic_inc(&task_rq(p)->nr_iowait);
+	int val = 1 + ((int) p->in_iowait_acct << 16);
+	atomic_add(val, &task_rq(p)->nr_iowait);
 #endif
 }
 
 static void task_iowait_dec(struct task_struct *p)
 {
 #ifdef CONFIG_64BIT
-	atomic_long_dec(&task_rq(p)->nr_iowait);
+	long val = 1 + ((long) p->in_iowait_acct << 32);
+	atomic_long_sub(val, &task_rq(p)->nr_iowait);
 #else
-	atomic_dec(&task_rq(p)->nr_iowait);
+	int val = 1 + ((int) p->in_iowait_acct << 16);
+	atomic_sub(val, &task_rq(p)->nr_iowait);
 #endif
 }
 
 int rq_iowait(struct rq *rq)
 {
 #ifdef CONFIG_64BIT
-	return atomic_long_read(&rq->nr_iowait);
+	return atomic_long_read(&rq->nr_iowait) & ((1UL << 32) - 1);
+#else
+	return atomic_read(&rq->nr_iowait) & ((1U << 16) - 1);
+#endif
+}
+
+int rq_iowait_acct(struct rq *rq)
+{
+#ifdef CONFIG_64BIT
+	return atomic_long_read(&rq->nr_iowait) >> 32;
 #else
-	return atomic_read(&rq->nr_iowait);
+	return atomic_read(&rq->nr_iowait) >> 16;
 #endif
 }
 
@@ -5497,7 +5514,12 @@ unsigned long long nr_context_switches(void)
  * it does become runnable.
  */
 
-unsigned int nr_iowait_cpu(int cpu)
+unsigned int nr_iowait_acct_cpu(int cpu)
+{
+	return rq_iowait_acct(cpu_rq(cpu));
+}
+
+unsigned nr_iowait_cpu(int cpu)
 {
 	return rq_iowait(cpu_rq(cpu));
 }
@@ -5532,12 +5554,12 @@ unsigned int nr_iowait_cpu(int cpu)
  * Task CPU affinities can make all that even more 'interesting'.
  */
 
-unsigned int nr_iowait(void)
+unsigned int nr_iowait_acct(void)
 {
 	unsigned int i, sum = 0;
 
 	for_each_possible_cpu(i)
-		sum += nr_iowait_cpu(i);
+		sum += nr_iowait_acct_cpu(i);
 
 	return sum;
 }
@@ -9032,18 +9054,32 @@ int __sched yield_to(struct task_struct *p, bool preempt)
 }
 EXPORT_SYMBOL_GPL(yield_to);
 
+/*
+ * Returns a token which is comprised of the two bits of iowait wait state -
+ * one is whether we're making ourselves as in iowait for cpufreq reasons,
+ * and the other is if the task should be accounted as such.
+ */
 long io_schedule_prepare(void)
 {
-	long old_iowait = current->in_iowait;
-
+#ifdef CONFIG_64BIT
+	long token = current->in_iowait + ((long) current->in_iowait_acct << 32);
+#else
+	int token = current->in_iowait + (current->in_iowait_acct << 16);
+#endif
 	current->in_iowait = 1;
+	current->in_iowait_acct = 1;
 	blk_flush_plug(current->plug, true);
-	return old_iowait;
+	return token;
 }
 
 void io_schedule_finish(long token)
 {
-	current->in_iowait = token;
+	current->in_iowait = token & 0x01;
+#ifdef CONFIG_64BIT
+	current->in_iowait_acct = token >> 32;
+#else
+	current->in_iowait_acct = token >> 16;
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c2802d066615..85fdd6028682 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3478,5 +3478,6 @@ extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 int rq_iowait(struct rq *rq);
+int rq_iowait_acct(struct rq *rq);
 
 #endif /* _KERNEL_SCHED_SCHED_H */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 269e21590df5..52f377c5871d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -728,7 +728,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	delta = ktime_sub(now, ts->idle_entrytime);
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
+	if (nr_iowait_acct_cpu(smp_processor_id()) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -801,7 +801,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
-				     !nr_iowait_cpu(cpu), last_update_time);
+				     !nr_iowait_acct_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -827,7 +827,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
-				     nr_iowait_cpu(cpu), last_update_time);
+				     nr_iowait_acct_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-16 12:11 ` [PATCH 4/4] sched/core: split iowait state into two states Jens Axboe
@ 2024-04-16 14:10   ` Christian Loehle
  2024-04-16 14:25     ` Christian Loehle
  2024-04-24 10:01   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Loehle @ 2024-04-16 14:10 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: peterz, tglx

On 16/04/2024 13:11, Jens Axboe wrote:
> iowait is a bogus metric, but it's helpful in the sense that it allows
> short waits to not enter sleep states that have a higher exit latency
> than would've otherwise have been picked for iowait'ing tasks. However,
> it's harmless in that lots of applications and monitoring assumes that
s/harmless/harmful ?
> iowait is busy time, or otherwise use it as a health metric.
> Particularly for async IO it's entirely nonsensical.
> 
> Split the iowait part into two parts - one that tracks whether the task
> needs boosting for short waits, and one that marks whether the task
> needs to be accounted as such. ->in_iowait_acct nests inside of
> ->in_iowait, both for efficiency reasons, but also so that the
> relationship between the two is clear. A waiter may set ->in_wait alone
s/in_wait/in_iowait/
> and not care about the accounting.
> 
> Existing users of nr_iowait() for accounting purposes are switched to
> use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
> only cares about iowaiters, not the accounting side.
> 
> Utilize that there's enough space in rq->nr_iowait to store both values
> in there, shifting the accounting side by half the size of the type.
> Thank you to Thomas Gleixner for that [1] suggestion.
> 
> [1] https://lore.kernel.org/lkml/87sf1b6o9w.ffs@tglx/
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Again this is just a tiny symptom of iowait being bogus, but I understand
your problem and have nothing simple to offer you as of now.
Daniel's comment mentioned below is still spot on.
Any behavior piggybacked ontop of iowait is also non-sensical without
any definition of what "iowait" is actually supposed to mean.
Are we returning to the same rq real soon? Maybe, maybe not.
Are we on the critical path to higher IO throughput and should be run
with a high frequency after wakeup? Maybe, maybe not (Probably not?).
Apart from that obligatory nagging and my comments below this looks okay.

> ---
>  arch/s390/appldata/appldata_base.c |  2 +-
>  arch/s390/appldata/appldata_os.c   |  2 +-
>  fs/proc/stat.c                     |  2 +-
>  include/linux/sched.h              |  6 +++
>  include/linux/sched/stat.h         |  5 ++-
>  kernel/sched/core.c                | 62 +++++++++++++++++++++++-------
>  kernel/sched/sched.h               |  1 +
>  kernel/time/tick-sched.c           |  6 +--
>  8 files changed, 65 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c
> index c2978cb03b36..6844b5294a8b 100644
> --- a/arch/s390/appldata/appldata_base.c
> +++ b/arch/s390/appldata/appldata_base.c
> @@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo);
>  #endif
>  EXPORT_SYMBOL_GPL(nr_threads);
>  EXPORT_SYMBOL_GPL(nr_running);
> -EXPORT_SYMBOL_GPL(nr_iowait);
> +EXPORT_SYMBOL_GPL(nr_iowait_acct);
> diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
> index a363d30ce739..fa4b278aca6c 100644
> --- a/arch/s390/appldata/appldata_os.c
> +++ b/arch/s390/appldata/appldata_os.c
> @@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data)
>  
>  	os_data->nr_threads = nr_threads;
>  	os_data->nr_running = nr_running();
> -	os_data->nr_iowait  = nr_iowait();
> +	os_data->nr_iowait  = nr_iowait_acct();
>  	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
>  	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
>  	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index da60956b2915..149be7a884fb 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v)
>  		(unsigned long long)boottime.tv_sec,
>  		total_forks,
>  		nr_running(),
> -		nr_iowait());
> +		nr_iowait_acct());
>  
>  	seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dcfc2830ed8e..26c69db5484b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -924,7 +924,13 @@ struct task_struct {
>  
>  	/* Bit to tell TOMOYO we're in execve(): */
>  	unsigned			in_execve:1;
> +	/* task is in iowait */
>  	unsigned			in_iowait:1;
> +	/*
> +	 * task is in iowait and should be accounted as such. can only be set
> +	 * if ->in_iowait is also set.
> +	 */
> +	unsigned			in_iowait_acct:1;
>  #ifndef TIF_RESTORE_SIGMASK
>  	unsigned			restore_sigmask:1;
>  #endif
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 0108a38bb64d..31e8a44b3d71 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -19,8 +19,9 @@ DECLARE_PER_CPU(unsigned long, process_counts);
>  extern int nr_processes(void);
>  extern unsigned int nr_running(void);
>  extern bool single_task_running(void);
> -extern unsigned int nr_iowait(void);
> -extern unsigned int nr_iowait_cpu(int cpu);
> +unsigned int nr_iowait_acct(void);
> +unsigned int nr_iowait_acct_cpu(int cpu);
> +unsigned int nr_iowait_cpu(int cpu);
>  
>  static inline int sched_info_on(void)
>  {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 63f6d44f460c..d52d3118dd73 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3721,30 +3721,47 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
>  
>  #endif /* !CONFIG_SMP */
>  
> +/*
> + * Store iowait and iowait_acct state in the same variable. The lower bits
> + * hold the iowait state, and the upper bits hold the iowait_acct state.
> + */
>  static void task_iowait_inc(struct task_struct *p)
>  {
>  #ifdef CONFIG_64BIT
> -	atomic_long_inc(&task_rq(p)->nr_iowait);
> +	long val = 1 + ((long) p->in_iowait_acct << 32);
Are we not doing declarations at the beginning of the block anymore?

> +	atomic_long_add(val, &task_rq(p)->nr_iowait);
>  #else
> -	atomic_inc(&task_rq(p)->nr_iowait);
> +	int val = 1 + ((int) p->in_iowait_acct << 16);
> +	atomic_add(val, &task_rq(p)->nr_iowait);
>  #endif
>  }
>  
>  static void task_iowait_dec(struct task_struct *p)
>  {
>  #ifdef CONFIG_64BIT
> -	atomic_long_dec(&task_rq(p)->nr_iowait);
> +	long val = 1 + ((long) p->in_iowait_acct << 32);
> +	atomic_long_sub(val, &task_rq(p)->nr_iowait);
>  #else
> -	atomic_dec(&task_rq(p)->nr_iowait);
> +	int val = 1 + ((int) p->in_iowait_acct << 16);
> +	atomic_sub(val, &task_rq(p)->nr_iowait);
>  #endif
>  }
>  
>  int rq_iowait(struct rq *rq)
>  {
>  #ifdef CONFIG_64BIT
> -	return atomic_long_read(&rq->nr_iowait);
> +	return atomic_long_read(&rq->nr_iowait) & ((1UL << 32) - 1);
> +#else
> +	return atomic_read(&rq->nr_iowait) & ((1U << 16) - 1);
> +#endif
> +}
> +
> +int rq_iowait_acct(struct rq *rq)
> +{
> +#ifdef CONFIG_64BIT
> +	return atomic_long_read(&rq->nr_iowait) >> 32;
>  #else
> -	return atomic_read(&rq->nr_iowait);
> +	return atomic_read(&rq->nr_iowait) >> 16;
>  #endif
>  }
>  
> @@ -5497,7 +5514,12 @@ unsigned long long nr_context_switches(void)
>   * it does become runnable.
>   */
>  
> -unsigned int nr_iowait_cpu(int cpu)
> +unsigned int nr_iowait_acct_cpu(int cpu)

There is a comment above here by Daniel referring to "these two
interfaces [...]", originally referring to nr_iowait_cpu() and
get_iowait_load().
get_iowait_load() was removed in commit a7fe5190c03f ("cpuidle: menu: Remove get_loadavg() from the performance multiplier")
but nr_iowait_cpu() remains, so the comment should remain above it.
Rewording is also way overdue, although that's hardly on your
patch ;)

> +{
> +	return rq_iowait_acct(cpu_rq(cpu));
> +}
> +
> +unsigned nr_iowait_cpu(int cpu)
unsigned int
>  {
>  	return rq_iowait(cpu_rq(cpu));
>  }
> @@ -5532,12 +5554,12 @@ unsigned int nr_iowait_cpu(int cpu)
>   * Task CPU affinities can make all that even more 'interesting'.
>   */
>  
> -unsigned int nr_iowait(void)
> +unsigned int nr_iowait_acct(void)
>  {
>  	unsigned int i, sum = 0;
>  
>  	for_each_possible_cpu(i)
> -		sum += nr_iowait_cpu(i);
> +		sum += nr_iowait_acct_cpu(i);
>  
>  	return sum;
>  }
> @@ -9032,18 +9054,32 @@ int __sched yield_to(struct task_struct *p, bool preempt)
>  }
>  EXPORT_SYMBOL_GPL(yield_to);
>  [snip]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-16 14:10   ` Christian Loehle
@ 2024-04-16 14:25     ` Christian Loehle
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Loehle @ 2024-04-16 14:25 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel; +Cc: peterz, tglx

On 16/04/2024 15:10, Christian Loehle wrote:
>>[snip]
>> +	atomic_long_add(val, &task_rq(p)->nr_iowait);
>>  #else
>> -	atomic_inc(&task_rq(p)->nr_iowait);
>> +	int val = 1 + ((int) p->in_iowait_acct << 16);
>> +	atomic_add(val, &task_rq(p)->nr_iowait);
>>  #endif
>>  }
>>  
>>  static void task_iowait_dec(struct task_struct *p)
>>  {
>>  #ifdef CONFIG_64BIT
>> -	atomic_long_dec(&task_rq(p)->nr_iowait);
>> +	long val = 1 + ((long) p->in_iowait_acct << 32);
>> +	atomic_long_sub(val, &task_rq(p)->nr_iowait);
>>  #else
>> -	atomic_dec(&task_rq(p)->nr_iowait);
>> +	int val = 1 + ((int) p->in_iowait_acct << 16);
>> +	atomic_sub(val, &task_rq(p)->nr_iowait);
>>  #endif
>>  }
>>  
>>  int rq_iowait(struct rq *rq)
>>  {
>>  #ifdef CONFIG_64BIT
>> -	return atomic_long_read(&rq->nr_iowait);
>> +	return atomic_long_read(&rq->nr_iowait) & ((1UL << 32) - 1);
>> +#else
>> +	return atomic_read(&rq->nr_iowait) & ((1U << 16) - 1);
>> +#endif
>> +}
>> +
>> +int rq_iowait_acct(struct rq *rq)
>> +{
>> +#ifdef CONFIG_64BIT
>> +	return atomic_long_read(&rq->nr_iowait) >> 32;
>>  #else
>> -	return atomic_read(&rq->nr_iowait);
>> +	return atomic_read(&rq->nr_iowait) >> 16;
>>  #endif
>>  }
>>  
>> @@ -5497,7 +5514,12 @@ unsigned long long nr_context_switches(void)
>>   * it does become runnable.
>>   */
>>  
>> -unsigned int nr_iowait_cpu(int cpu)
>> +unsigned int nr_iowait_acct_cpu(int cpu)
> 
> There is a comment above here by Daniel referring to "these two
> interfaces [...]", originally referring to nr_iowait_cpu() and
> get_iowait_load().
> get_iowait_load() was removed in commit a7fe5190c03f ("cpuidle: menu: Remove get_loadavg() from the performance multiplier")
> but nr_iowait_cpu() remains, so the comment should remain above it.
> Rewording is also way overdue, although that's hardly on your
> patch ;)
> 
FWIW, but also feel free to reword yourself.

From ba66fe24c64f5615b9820b0fe24857ecbf18fc5f Mon Sep 17 00:00:00 2001
From: Christian Loehle <christian.loehle@arm.com>
Date: Tue, 16 Apr 2024 15:20:22 +0100
Subject: [PATCH] sched/core: Reword comment about iowait interface

The original comment referenced an interface which was removed in
the meantime by commit a7fe5190c03f ("cpuidle: menu: Remove
get_loadavg() from the performance multiplier").

Furthermore the function has moved so move the comment accordingly.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/core.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d52d3118dd73..bb10b9bc37d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5507,18 +5507,19 @@ unsigned long long nr_context_switches(void)
 	return sum;
 }
 
-/*
- * Consumers of these two interfaces, like for example the cpuidle menu
- * governor, are using nonsensical data. Preferring shallow idle state selection
- * for a CPU that has IO-wait which might not even end up running the task when
- * it does become runnable.
- */
 
 unsigned int nr_iowait_acct_cpu(int cpu)
 {
 	return rq_iowait_acct(cpu_rq(cpu));
 }
 
+/*
+ * Consumers of this interface like the cpuidle menu governor are using
+ * nonsensical data. Preferring shallow idle state selection for a CPU that
+ * has IO-wait which might not even end up being selected for the task again
+ * when it does become runnable.
+ */
+
 unsigned nr_iowait_cpu(int cpu)
 {
 	return rq_iowait(cpu_rq(cpu));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] sched/core: change rq->nr_iowait type to atomic_long_t on 64-bit
  2024-04-16 12:11 ` [PATCH 2/4] sched/core: change rq->nr_iowait type to atomic_long_t on 64-bit Jens Axboe
@ 2024-04-24  9:56   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2024-04-24  9:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, tglx

On Tue, Apr 16, 2024 at 06:11:19AM -0600, Jens Axboe wrote:
> In preparation for storing two separate iowait states in there, bump the
> size from a 32-bit to a 64-bit size, for 64-bit kernels.
> 
> Note that on 32-bit, the number of tasks are limited to 0x8000, which
> fits just fine in even half of the existiing 32-bit atomic_t. For 64-bit,
> no such limit exists, hence play it safe and make it a 64-bit atomic.

We still have the tid limit, no? Anyway...

> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  kernel/sched/core.c  | 14 +++++++++++++-
>  kernel/sched/sched.h |  4 ++++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 977bb08a33d2..6a6c985220b1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3723,17 +3723,29 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
>  
>  static void task_iowait_inc(struct task_struct *p)
>  {
> +#ifdef CONFIG_64BIT
> +	atomic_long_inc(&task_rq(p)->nr_iowait);
> +#else
>  	atomic_inc(&task_rq(p)->nr_iowait);
> +#endif
>  }
>  
>  static void task_iowait_dec(struct task_struct *p)
>  {
> +#ifdef CONFIG_64BIT
> +	atomic_long_dec(&task_rq(p)->nr_iowait);
> +#else
>  	atomic_dec(&task_rq(p)->nr_iowait);
> +#endif
>  }
>  
>  int rq_iowait(struct rq *rq)
>  {
> +#ifdef CONFIG_64BIT
> +	return atomic_long_read(&rq->nr_iowait);
> +#else
>  	return atomic_read(&rq->nr_iowait);
> +#endif
>  }
>  
>  static void
> @@ -10065,7 +10077,7 @@ void __init sched_init(void)
>  #endif
>  #endif /* CONFIG_SMP */
>  		hrtick_rq_init(rq);
> -		atomic_set(&rq->nr_iowait, 0);
> +		atomic_long_set(&rq->nr_iowait, 0);
>  

This one site lacks the ifdeffery, which seems superfluous anyway, since
long is already 32bit / 64bit like you want. Hmm?

>  #ifdef CONFIG_SCHED_CORE
>  		rq->core = rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 387f67ddf18a..c2802d066615 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1049,7 +1049,11 @@ struct rq {
>  	u64			clock_idle_copy;
>  #endif
>  
> +#ifdef CONFIG_64BIT
> +	atomic_long_t		nr_iowait;
> +#else
>  	atomic_t		nr_iowait;
> +#endif
>  
>  #ifdef CONFIG_SCHED_DEBUG
>  	u64 last_seen_need_resched_ns;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-16 12:11 ` [PATCH 4/4] sched/core: split iowait state into two states Jens Axboe
  2024-04-16 14:10   ` Christian Loehle
@ 2024-04-24 10:01   ` Peter Zijlstra
  2024-04-24 10:08     ` Christian Loehle
  2024-04-25 14:20     ` Rafael J. Wysocki
  1 sibling, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2024-04-24 10:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, tglx, Rafael J. Wysocki, linux-pm, daniel.lezcano

On Tue, Apr 16, 2024 at 06:11:21AM -0600, Jens Axboe wrote:
> iowait is a bogus metric, but it's helpful in the sense that it allows
> short waits to not enter sleep states that have a higher exit latency
> than would've otherwise have been picked for iowait'ing tasks. However,
> it's harmless in that lots of applications and monitoring assumes that
> iowait is busy time, or otherwise use it as a health metric.
> Particularly for async IO it's entirely nonsensical.

Let me get this straight, all of this is about working around
cpuidle menu governor insaity?

Rafael, how far along are we with fully deprecating that thing? Yes it
still exists, but should people really be using it still?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-24 10:01   ` Peter Zijlstra
@ 2024-04-24 10:08     ` Christian Loehle
  2024-04-25 10:16       ` Peter Zijlstra
  2024-04-25 14:20     ` Rafael J. Wysocki
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Loehle @ 2024-04-24 10:08 UTC (permalink / raw)
  To: Peter Zijlstra, Jens Axboe
  Cc: linux-kernel, tglx, Rafael J. Wysocki, linux-pm, daniel.lezcano

On 24/04/2024 11:01, Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 06:11:21AM -0600, Jens Axboe wrote:
>> iowait is a bogus metric, but it's helpful in the sense that it allows
>> short waits to not enter sleep states that have a higher exit latency
>> than would've otherwise have been picked for iowait'ing tasks. However,
>> it's harmless in that lots of applications and monitoring assumes that
>> iowait is busy time, or otherwise use it as a health metric.
>> Particularly for async IO it's entirely nonsensical.
> 
> Let me get this straight, all of this is about working around
> cpuidle menu governor insaity?
> 
> Rafael, how far along are we with fully deprecating that thing? Yes it
> still exists, but should people really be using it still?
> 

Well there is also the iowait boost handling in schedutil and intel_pstate
which, at least in synthetic benchmarks, does have an effect [1].
io_uring (the only user of iowait but not iowait_acct) works around both.

See commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")

[1]
https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/#t

Kind Regards,
Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-24 10:08     ` Christian Loehle
@ 2024-04-25 10:16       ` Peter Zijlstra
  2024-04-25 10:39         ` Christian Loehle
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2024-04-25 10:16 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Jens Axboe, linux-kernel, tglx, Rafael J. Wysocki, linux-pm,
	daniel.lezcano

On Wed, Apr 24, 2024 at 11:08:42AM +0100, Christian Loehle wrote:
> On 24/04/2024 11:01, Peter Zijlstra wrote:
> > On Tue, Apr 16, 2024 at 06:11:21AM -0600, Jens Axboe wrote:
> >> iowait is a bogus metric, but it's helpful in the sense that it allows
> >> short waits to not enter sleep states that have a higher exit latency
> >> than would've otherwise have been picked for iowait'ing tasks. However,
> >> it's harmless in that lots of applications and monitoring assumes that
> >> iowait is busy time, or otherwise use it as a health metric.
> >> Particularly for async IO it's entirely nonsensical.
> > 
> > Let me get this straight, all of this is about working around
> > cpuidle menu governor insaity?
> > 
> > Rafael, how far along are we with fully deprecating that thing? Yes it
> > still exists, but should people really be using it still?
> > 
> 
> Well there is also the iowait boost handling in schedutil and intel_pstate
> which, at least in synthetic benchmarks, does have an effect [1].

Those are cpufreq not cpuidle and at least they don't use nr_iowait. The
original Changelog mentioned idle states, and I hate on menu for using
nr_iowait.

> io_uring (the only user of iowait but not iowait_acct) works around both.
> 
> See commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
> 
> [1]
> https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/#t

So while I agree with most of the short-commings listed in that set,
however that patch is quite terrifying.

I would prefer to start with something a *lot* simpler. How about a tick
driven decay of iops count per task. And that whole step array
*shudder*.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-25 10:16       ` Peter Zijlstra
@ 2024-04-25 10:39         ` Christian Loehle
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Loehle @ 2024-04-25 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-kernel, tglx, Rafael J. Wysocki, linux-pm,
	daniel.lezcano

On 25/04/2024 11:16, Peter Zijlstra wrote:
> On Wed, Apr 24, 2024 at 11:08:42AM +0100, Christian Loehle wrote:
>> On 24/04/2024 11:01, Peter Zijlstra wrote:
>>> On Tue, Apr 16, 2024 at 06:11:21AM -0600, Jens Axboe wrote:
>>>> iowait is a bogus metric, but it's helpful in the sense that it allows
>>>> short waits to not enter sleep states that have a higher exit latency
>>>> than would've otherwise have been picked for iowait'ing tasks. However,
>>>> it's harmless in that lots of applications and monitoring assumes that
>>>> iowait is busy time, or otherwise use it as a health metric.
>>>> Particularly for async IO it's entirely nonsensical.
>>>
>>> Let me get this straight, all of this is about working around
>>> cpuidle menu governor insaity?
>>>
>>> Rafael, how far along are we with fully deprecating that thing? Yes it
>>> still exists, but should people really be using it still?
>>>
>>
>> Well there is also the iowait boost handling in schedutil and intel_pstate
>> which, at least in synthetic benchmarks, does have an effect [1].
> 
> Those are cpufreq not cpuidle and at least they don't use nr_iowait. The
> original Changelog mentioned idle states, and I hate on menu for using
> nr_iowait.

I'd say they care about any regression, but I'll let Jens answer that.
The original change also mentions cpufreq and Jens did mention in an
earlier version that he doesn't care, for them it's all just increased
latency ;) 
https://lore.kernel.org/lkml/00d36e83-c9a5-412d-bf49-2e109308d6cd@arm.com/T/#m216536520bc31846aff5875993d22f446a37b297

> 
>> io_uring (the only user of iowait but not iowait_acct) works around both.
>>
>> See commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
>>
>> [1]
>> https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/#t
> 
> So while I agree with most of the short-commings listed in that set,
> however that patch is quite terrifying.

Not disagreeing with you on that.

> 
> I would prefer to start with something a *lot* simpler. How about a tick
> driven decay of iops count per task. And that whole step array
> *shudder*.

It's an attempt of solving unnecessary boosting based upon what is there for
us to work with now: iowait wakeups.
There are many workloads with e.g. > 5000 iowait wakeups per second that don't
benefit from boosting at all (and therefore it's a complete energy waste).
I don't see anything obvious how we would attempt to detect non-boost-worthy
scenarios with a tick driven decay count, but please do elaborate.

(If you *really* care about IO throughput, the task wakeup path is hopefully
not critical anyway (i.e. you do everything in your power to have IO pending
during that time) and then we don't need boosting, but just looking
at a tick-length period doesn't let us distinguish those scenarios AFAICS.)

Regards,
Christian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-04-24 10:01   ` Peter Zijlstra
  2024-04-24 10:08     ` Christian Loehle
@ 2024-04-25 14:20     ` Rafael J. Wysocki
  1 sibling, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2024-04-25 14:20 UTC (permalink / raw)
  To: Jens Axboe, Peter Zijlstra
  Cc: linux-kernel, tglx, linux-pm, daniel.lezcano, Rafael J. Wysocki

On Wednesday, April 24, 2024 12:01:27 PM CEST Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 06:11:21AM -0600, Jens Axboe wrote:
> > iowait is a bogus metric, but it's helpful in the sense that it allows
> > short waits to not enter sleep states that have a higher exit latency
> > than would've otherwise have been picked for iowait'ing tasks. However,
> > it's harmless in that lots of applications and monitoring assumes that
> > iowait is busy time, or otherwise use it as a health metric.
> > Particularly for async IO it's entirely nonsensical.
> 
> Let me get this straight, all of this is about working around
> cpuidle menu governor insaity?
> 
> Rafael, how far along are we with fully deprecating that thing? Yes it
> still exists, but should people really be using it still?

Well, they appear to be used to it ...




^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/4] sched/core: split iowait state into two states
  2024-08-17 20:45 [PATCHSET v5 0/4] Split iowait " Jens Axboe
@ 2024-08-17 20:45 ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2024-08-17 20:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, Jens Axboe

iowait is a bogus metric, but it's helpful in the sense that it allows
short waits to not enter sleep states that have a higher exit latency
than would've otherwise have been picked for iowait'ing tasks. However,
it's harmless in that lots of applications and monitoring assumes that
iowait is busy time, or otherwise use it as a health metric.
Particularly for async IO it's entirely nonsensical.

Split the iowait part into two parts - one that tracks whether the task
needs boosting for short waits, and one that marks whether the task
needs to be accounted as such. ->in_iowait_acct nests inside of
->in_iowait, both for efficiency reasons, but also so that the
relationship between the two is clear. A waiter may set ->in_wait alone
and not care about the accounting.

Existing users of nr_iowait() for accounting purposes are switched to
use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
only cares about iowaiters, not the accounting side.

Utilize that there's enough space in rq->nr_iowait to store both values
in there, shifting the accounting side by half the size of the type.
Thank you to Thomas Gleixner for that [1] suggestion.

[1] https://lore.kernel.org/lkml/87sf1b6o9w.ffs@tglx/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/s390/appldata/appldata_base.c |  2 +-
 arch/s390/appldata/appldata_os.c   |  2 +-
 fs/proc/stat.c                     |  2 +-
 include/linux/sched.h              |  6 +++
 include/linux/sched/stat.h         |  5 ++-
 kernel/sched/core.c                | 62 +++++++++++++++++++++++-------
 kernel/sched/sched.h               |  1 +
 kernel/time/tick-sched.c           |  6 +--
 8 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c
index 91a30e017d65..5b21225405fd 100644
--- a/arch/s390/appldata/appldata_base.c
+++ b/arch/s390/appldata/appldata_base.c
@@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo);
 #endif
 EXPORT_SYMBOL_GPL(nr_threads);
 EXPORT_SYMBOL_GPL(nr_running);
-EXPORT_SYMBOL_GPL(nr_iowait);
+EXPORT_SYMBOL_GPL(nr_iowait_acct);
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index a363d30ce739..fa4b278aca6c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data)
 
 	os_data->nr_threads = nr_threads;
 	os_data->nr_running = nr_running();
-	os_data->nr_iowait  = nr_iowait();
+	os_data->nr_iowait  = nr_iowait_acct();
 	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
 	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
 	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index da60956b2915..149be7a884fb 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)boottime.tv_sec,
 		total_forks,
 		nr_running(),
-		nr_iowait());
+		nr_iowait_acct());
 
 	seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1a65e19a3ac..1b40847b0a80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,7 +953,13 @@ struct task_struct {
 
 	/* Bit to tell TOMOYO we're in execve(): */
 	unsigned			in_execve:1;
+	/* task is in iowait */
 	unsigned			in_iowait:1;
+	/*
+	 * task is in iowait and should be accounted as such. can only be set
+	 * if ->in_iowait is also set.
+	 */
+	unsigned			in_iowait_acct:1;
 #ifndef TIF_RESTORE_SIGMASK
 	unsigned			restore_sigmask:1;
 #endif
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 0108a38bb64d..31e8a44b3d71 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -19,8 +19,9 @@ DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
 extern unsigned int nr_running(void);
 extern bool single_task_running(void);
-extern unsigned int nr_iowait(void);
-extern unsigned int nr_iowait_cpu(int cpu);
+unsigned int nr_iowait_acct(void);
+unsigned int nr_iowait_acct_cpu(int cpu);
+unsigned int nr_iowait_cpu(int cpu);
 
 static inline int sched_info_on(void)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7cb7ca38fdfc..94af01c6b85d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3541,30 +3541,47 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 
 #endif /* !CONFIG_SMP */
 
+/*
+ * Store iowait and iowait_acct state in the same variable. The lower bits
+ * hold the iowait state, and the upper bits hold the iowait_acct state.
+ */
 static void task_iowait_inc(struct task_struct *p)
 {
 #ifdef CONFIG_64BIT
-	atomic_long_inc(&task_rq(p)->nr_iowait);
+	long val = 1 + ((long) p->in_iowait_acct << 32);
+	atomic_long_add(val, &task_rq(p)->nr_iowait);
 #else
-	atomic_inc(&task_rq(p)->nr_iowait);
+	int val = 1 + ((int) p->in_iowait_acct << 16);
+	atomic_add(val, &task_rq(p)->nr_iowait);
 #endif
 }
 
 static void task_iowait_dec(struct task_struct *p)
 {
 #ifdef CONFIG_64BIT
-	atomic_long_dec(&task_rq(p)->nr_iowait);
+	long val = 1 + ((long) p->in_iowait_acct << 32);
+	atomic_long_sub(val, &task_rq(p)->nr_iowait);
 #else
-	atomic_dec(&task_rq(p)->nr_iowait);
+	int val = 1 + ((int) p->in_iowait_acct << 16);
+	atomic_sub(val, &task_rq(p)->nr_iowait);
 #endif
 }
 
 int rq_iowait(struct rq *rq)
 {
 #ifdef CONFIG_64BIT
-	return atomic_long_read(&rq->nr_iowait);
+	return atomic_long_read(&rq->nr_iowait) & ((1UL << 32) - 1);
+#else
+	return atomic_read(&rq->nr_iowait) & ((1U << 16) - 1);
+#endif
+}
+
+int rq_iowait_acct(struct rq *rq)
+{
+#ifdef CONFIG_64BIT
+	return atomic_long_read(&rq->nr_iowait) >> 32;
 #else
-	return atomic_read(&rq->nr_iowait);
+	return atomic_read(&rq->nr_iowait) >> 16;
 #endif
 }
 
@@ -5307,7 +5324,12 @@ unsigned long long nr_context_switches(void)
  * it does become runnable.
  */
 
-unsigned int nr_iowait_cpu(int cpu)
+unsigned int nr_iowait_acct_cpu(int cpu)
+{
+	return rq_iowait_acct(cpu_rq(cpu));
+}
+
+unsigned nr_iowait_cpu(int cpu)
 {
 	return rq_iowait(cpu_rq(cpu));
 }
@@ -5342,12 +5364,12 @@ unsigned int nr_iowait_cpu(int cpu)
  * Task CPU affinities can make all that even more 'interesting'.
  */
 
-unsigned int nr_iowait(void)
+unsigned int nr_iowait_acct(void)
 {
 	unsigned int i, sum = 0;
 
 	for_each_possible_cpu(i)
-		sum += nr_iowait_cpu(i);
+		sum += nr_iowait_acct_cpu(i);
 
 	return sum;
 }
@@ -7427,18 +7449,32 @@ static inline void preempt_dynamic_init(void) { }
 
 #endif /* CONFIG_PREEMPT_DYNAMIC */
 
+/*
+ * Returns a token which is comprised of the two bits of iowait wait state -
+ * one is whether we're making ourselves as in iowait for cpufreq reasons,
+ * and the other is if the task should be accounted as such.
+ */
 long io_schedule_prepare(void)
 {
-	long old_iowait = current->in_iowait;
-
+#ifdef CONFIG_64BIT
+	long token = current->in_iowait + ((long) current->in_iowait_acct << 32);
+#else
+	int token = current->in_iowait + (current->in_iowait_acct << 16);
+#endif
 	current->in_iowait = 1;
+	current->in_iowait_acct = 1;
 	blk_flush_plug(current->plug, true);
-	return old_iowait;
+	return token;
 }
 
 void io_schedule_finish(long token)
 {
-	current->in_iowait = token;
+	current->in_iowait = token & 0x01;
+#ifdef CONFIG_64BIT
+	current->in_iowait_acct = token >> 32;
+#else
+	current->in_iowait_acct = token >> 16;
+#endif
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a90c2da1eb3..0f0eb2b26816 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3584,6 +3584,7 @@ extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 int rq_iowait(struct rq *rq);
+int rq_iowait_acct(struct rq *rq);
 
 #ifdef CONFIG_RT_MUTEXES
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184c7090..17a82771c3df 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -732,7 +732,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	delta = ktime_sub(now, ts->idle_entrytime);
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
+	if (nr_iowait_acct_cpu(smp_processor_id()) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -805,7 +805,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
-				     !nr_iowait_cpu(cpu), last_update_time);
+				     !nr_iowait_acct_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -831,7 +831,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
-				     nr_iowait_cpu(cpu), last_update_time);
+				     nr_iowait_acct_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] sched/core: split iowait state into two states
  2024-08-19 15:39 [PATCHSET v6 0/4] Split iowait " Jens Axboe
@ 2024-08-19 15:39 ` Jens Axboe
  2024-09-05 10:55   ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2024-08-19 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, tglx, Jens Axboe

iowait is a bogus metric, but it's helpful in the sense that it allows
short waits to not enter sleep states that have a higher exit latency
than would've otherwise have been picked for iowait'ing tasks. However,
it's harmless in that lots of applications and monitoring assumes that
iowait is busy time, or otherwise use it as a health metric.
Particularly for async IO it's entirely nonsensical.

Split the iowait part into two parts - one that tracks whether the task
needs boosting for short waits, and one that marks whether the task
needs to be accounted as such. ->in_iowait_acct nests inside of
->in_iowait, both for efficiency reasons, but also so that the
relationship between the two is clear. A waiter may set ->in_wait alone
and not care about the accounting.

Existing users of nr_iowait() for accounting purposes are switched to
use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
only cares about iowaiters, not the accounting side.

Utilize that there's enough space in rq->nr_iowait to store both values
in there, shifting the accounting side by half the size of the type.
Thank you to Thomas Gleixner for that [1] suggestion.

[1] https://lore.kernel.org/lkml/87sf1b6o9w.ffs@tglx/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/s390/appldata/appldata_base.c |  2 +-
 arch/s390/appldata/appldata_os.c   |  2 +-
 fs/proc/stat.c                     |  2 +-
 include/linux/sched.h              |  6 ++++
 include/linux/sched/stat.h         |  5 ++--
 kernel/sched/core.c                | 45 +++++++++++++++++++++++-------
 kernel/sched/sched.h               |  1 +
 kernel/time/tick-sched.c           |  6 ++--
 8 files changed, 51 insertions(+), 18 deletions(-)

diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c
index 91a30e017d65..5b21225405fd 100644
--- a/arch/s390/appldata/appldata_base.c
+++ b/arch/s390/appldata/appldata_base.c
@@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo);
 #endif
 EXPORT_SYMBOL_GPL(nr_threads);
 EXPORT_SYMBOL_GPL(nr_running);
-EXPORT_SYMBOL_GPL(nr_iowait);
+EXPORT_SYMBOL_GPL(nr_iowait_acct);
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index a363d30ce739..fa4b278aca6c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data)
 
 	os_data->nr_threads = nr_threads;
 	os_data->nr_running = nr_running();
-	os_data->nr_iowait  = nr_iowait();
+	os_data->nr_iowait  = nr_iowait_acct();
 	os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
 	os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
 	os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index da60956b2915..149be7a884fb 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)boottime.tv_sec,
 		total_forks,
 		nr_running(),
-		nr_iowait());
+		nr_iowait_acct());
 
 	seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1a65e19a3ac..1b40847b0a80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,7 +953,13 @@ struct task_struct {
 
 	/* Bit to tell TOMOYO we're in execve(): */
 	unsigned			in_execve:1;
+	/* task is in iowait */
 	unsigned			in_iowait:1;
+	/*
+	 * task is in iowait and should be accounted as such. can only be set
+	 * if ->in_iowait is also set.
+	 */
+	unsigned			in_iowait_acct:1;
 #ifndef TIF_RESTORE_SIGMASK
 	unsigned			restore_sigmask:1;
 #endif
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 0108a38bb64d..31e8a44b3d71 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -19,8 +19,9 @@ DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
 extern unsigned int nr_running(void);
 extern bool single_task_running(void);
-extern unsigned int nr_iowait(void);
-extern unsigned int nr_iowait_cpu(int cpu);
+unsigned int nr_iowait_acct(void);
+unsigned int nr_iowait_acct_cpu(int cpu);
+unsigned int nr_iowait_cpu(int cpu);
 
 static inline int sched_info_on(void)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d6c494e33206..a082919caaf8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3541,19 +3541,33 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 
 #endif /* !CONFIG_SMP */
 
+/* Shift half the size of the type, atomic_long_t */
+#define IOWAIT_SHIFT	(4 * sizeof(atomic_long_t))
+
+/*
+ * Store iowait and iowait_acct state in the same variable. The lower bits
+ * hold the iowait state, and the upper bits hold the iowait_acct state.
+ */
 static void task_iowait_inc(struct task_struct *p)
 {
-	atomic_long_inc(&task_rq(p)->nr_iowait);
+	long val = 1 + ((long) p->in_iowait_acct << IOWAIT_SHIFT);
+	atomic_long_add(val, &task_rq(p)->nr_iowait);
 }
 
 static void task_iowait_dec(struct task_struct *p)
 {
-	atomic_long_dec(&task_rq(p)->nr_iowait);
+	long val = 1 + ((long) p->in_iowait_acct << IOWAIT_SHIFT);
+	atomic_long_sub(val, &task_rq(p)->nr_iowait);
 }
 
 int rq_iowait(struct rq *rq)
 {
-	return atomic_long_read(&rq->nr_iowait);
+	return atomic_long_read(&rq->nr_iowait) & ((1UL << IOWAIT_SHIFT) - 1);
+}
+
+int rq_iowait_acct(struct rq *rq)
+{
+	return atomic_long_read(&rq->nr_iowait) >> IOWAIT_SHIFT;
 }
 
 static void
@@ -5295,7 +5309,12 @@ unsigned long long nr_context_switches(void)
  * it does become runnable.
  */
 
-unsigned int nr_iowait_cpu(int cpu)
+unsigned int nr_iowait_acct_cpu(int cpu)
+{
+	return rq_iowait_acct(cpu_rq(cpu));
+}
+
+unsigned nr_iowait_cpu(int cpu)
 {
 	return rq_iowait(cpu_rq(cpu));
 }
@@ -5330,12 +5349,12 @@ unsigned int nr_iowait_cpu(int cpu)
  * Task CPU affinities can make all that even more 'interesting'.
  */
 
-unsigned int nr_iowait(void)
+unsigned int nr_iowait_acct(void)
 {
 	unsigned int i, sum = 0;
 
 	for_each_possible_cpu(i)
-		sum += nr_iowait_cpu(i);
+		sum += nr_iowait_acct_cpu(i);
 
 	return sum;
 }
@@ -7415,18 +7434,24 @@ static inline void preempt_dynamic_init(void) { }
 
 #endif /* CONFIG_PREEMPT_DYNAMIC */
 
+/*
+ * Returns a token which is comprised of the two bits of iowait wait state -
+ * one is whether we're making ourselves as in iowait for cpufreq reasons,
+ * and the other is if the task should be accounted as such.
+ */
 long io_schedule_prepare(void)
 {
-	long old_iowait = current->in_iowait;
-
+	long token = current->in_iowait + ((long) current->in_iowait_acct << IOWAIT_SHIFT);
 	current->in_iowait = 1;
+	current->in_iowait_acct = 1;
 	blk_flush_plug(current->plug, true);
-	return old_iowait;
+	return token;
 }
 
 void io_schedule_finish(long token)
 {
-	current->in_iowait = token;
+	current->in_iowait = token & 0x01;
+	current->in_iowait_acct = token >> IOWAIT_SHIFT;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index da2e67621f39..be377ecd1301 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3580,6 +3580,7 @@ extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 int rq_iowait(struct rq *rq);
+int rq_iowait_acct(struct rq *rq);
 
 #ifdef CONFIG_RT_MUTEXES
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184c7090..17a82771c3df 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -732,7 +732,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	delta = ktime_sub(now, ts->idle_entrytime);
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
+	if (nr_iowait_acct_cpu(smp_processor_id()) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -805,7 +805,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
-				     !nr_iowait_cpu(cpu), last_update_time);
+				     !nr_iowait_acct_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -831,7 +831,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 
 	return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
-				     nr_iowait_cpu(cpu), last_update_time);
+				     nr_iowait_acct_cpu(cpu), last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/core: split iowait state into two states
  2024-08-19 15:39 ` [PATCH 4/4] sched/core: split iowait state " Jens Axboe
@ 2024-09-05 10:55   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2024-09-05 10:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, tglx

On Mon, Aug 19, 2024 at 09:39:49AM -0600, Jens Axboe wrote:

> iowait is a bogus metric, but it's helpful in the sense that it allows
> short waits to not enter sleep states that have a higher exit latency
> than would've otherwise have been picked for iowait'ing tasks. However,
> it's harmless in that lots of applications and monitoring assumes that
> iowait is busy time, or otherwise use it as a health metric.
> Particularly for async IO it's entirely nonsensical.

Holdup... I should've remembered this. You're using ->in_iowait for
io_uring. And that is what started all this.

Now, having in_iowait set when you're waiting on a futex is utterly
insane, but looking at commit 7b72d661f1f2 ("io_uring: gate iowait
schedule on having pending requests") you're now only actually setting
in_iowait when you have pending IO.

And I don't think that is nonsensical as you write above.

You are after all still actually waiting while there is pending IO, no?

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-09-05 10:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-16 12:11 [PATCHSET v4 0/4] Split iowait into two states Jens Axboe
2024-04-16 12:11 ` [PATCH 1/4] sched/core: add helpers for iowait handling Jens Axboe
2024-04-16 12:11 ` [PATCH 2/4] sched/core: change rq->nr_iowait type to atomic_long_t on 64-bit Jens Axboe
2024-04-24  9:56   ` Peter Zijlstra
2024-04-16 12:11 ` [PATCH 3/4] sched/core: have io_schedule_prepare() return a long Jens Axboe
2024-04-16 12:11 ` [PATCH 4/4] sched/core: split iowait state into two states Jens Axboe
2024-04-16 14:10   ` Christian Loehle
2024-04-16 14:25     ` Christian Loehle
2024-04-24 10:01   ` Peter Zijlstra
2024-04-24 10:08     ` Christian Loehle
2024-04-25 10:16       ` Peter Zijlstra
2024-04-25 10:39         ` Christian Loehle
2024-04-25 14:20     ` Rafael J. Wysocki
  -- strict thread matches above, loose matches on Subject: below --
2024-08-17 20:45 [PATCHSET v5 0/4] Split iowait " Jens Axboe
2024-08-17 20:45 ` [PATCH 4/4] sched/core: split iowait state " Jens Axboe
2024-08-19 15:39 [PATCHSET v6 0/4] Split iowait " Jens Axboe
2024-08-19 15:39 ` [PATCH 4/4] sched/core: split iowait state " Jens Axboe
2024-09-05 10:55   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox