[PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org,
	mingo@redhat.com, axboe@kernel.dk, tytso@mit.edu, jack@suse.com,
	adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, mingbo@fb.com
Subject: [PATCH v3 1/4] sched: move IO scheduling accounting from io_schedule_timeout() into scheduler
Date: Wed, 7 Dec 2016 15:48:41 -0500	[thread overview]
Message-ID: <20161207204841.GA22296@htj.duckdns.org> (raw)
In-Reply-To: <20161207093510.GE3107@twins.programming.kicks-ass.net>

Hello,

Yeah, that works.  Here's v3 based on your patch.  The other patches
still apply correctly.

Thanks.

------ 8< ------
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside the
scheduler.

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling.  This patch moves #3 into the
scheduler so that later patches can separate out preparation and
finish steps from io_schedule().

v3: Replaced with PeterZ's implementation which performs nr_iowait
    accounting in the sleep and wake up path to avoid unnecessarily
    burdening non sleeping paths in __schedule().

v2: Remember the rq in @prev_rq and use it for decrementing nr_iowait
    to avoid misattributing the count after the task gets migrated to
    another CPU.  Noticed by Pavan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Patch-originally-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
---
 kernel/sched/core.c |   68 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 7 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2085,11 +2085,24 @@ try_to_wake_up(struct task_struct *p, un
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
 
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
 	}
+
+#else /* CONFIG_SMP */
+
+	if (p->in_iowait) {
+		delayacct_blkio_end();
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
@@ -2139,8 +2152,13 @@ static void try_to_wake_up_local(struct
 
 	trace_sched_waking(p);
 
-	if (!task_on_rq_queued(p))
+	if (!task_on_rq_queued(p)) {
+		if (p->in_iowait) {
+			delayacct_blkio_end();
+			atomic_dec(&rq->nr_iowait);
+		}
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+	}
 
 	ttwu_do_wakeup(rq, p, 0, cookie);
 	ttwu_stat(p, smp_processor_id(), 0);
@@ -2948,6 +2966,36 @@ unsigned long long nr_context_switches(v
 	return sum;
 }
 
+/*
+ * IO-wait accounting, and how its mostly bollocks (on SMP).
+ *
+ * The idea behind IO-wait account is to account the idle time that we could
+ * have spend running if it were not for IO. That is, if we were to improve the
+ * storage performance, we'd have a proportional reduction in IO-wait time.
+ *
+ * This all works nicely on UP, where, when a task blocks on IO, we account
+ * idle time as IO-wait, because if the storage were faster, it could've been
+ * running and we'd not be idle.
+ *
+ * This has been extended to SMP, by doing the same for each CPU. This however
+ * is broken.
+ *
+ * Imagine for instance the case where two tasks block on one CPU, only the one
+ * CPU will have IO-wait accounted, while the other has regular idle. Even
+ * though, if the storage were faster, both could've ran at the same time,
+ * utilising both CPUs.
+ *
+ * This means, that when looking globally, the current IO-wait accounting on
+ * SMP is a lower bound, by reason of under accounting.
+ *
+ * Worse, since the numbers are provided per CPU, they are sometimes
+ * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
+ * associated with any one particular CPU, it can wake to another CPU than it
+ * blocked on. This means the per CPU IO-wait number is meaningless.
+ *
+ * Task CPU affinities can make all that even more 'interesting'.
+ */
+
 unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
@@ -2958,6 +3006,13 @@ unsigned long nr_iowait(void)
 	return sum;
 }
 
+/*
+ * Consumers of these two interfaces, like for example the cpufreq menu
+ * governor are using nonsensical data. Boosting frequency for a CPU that has
+ * IO-wait which might not even end up running the task when it does become
+ * runnable.
+ */
+
 unsigned long nr_iowait_cpu(int cpu)
 {
 	struct rq *this = cpu_rq(cpu);
@@ -3369,6 +3424,11 @@ static void __sched notrace __schedule(b
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
+			if (prev->in_iowait) {
+				atomic_inc(&rq->nr_iowait);
+				delayacct_blkio_start();
+			}
+
 			/*
 			 * If a worker went to sleep, notify and ask workqueue
 			 * whether it wants to wake up a task to maintain
@@ -5063,19 +5123,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 long __sched io_schedule_timeout(long timeout)
 {
 	int old_iowait = current->in_iowait;
-	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
 	blk_schedule_flush_plug(current);
 
-	delayacct_blkio_start();
-	rq = raw_rq();
-	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	current->in_iowait = old_iowait;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
 
 	return ret;
 }

next prev parent reply	other threads:[~2016-12-07 20:48 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-28 16:58 [PATCHSET RFC] sched, jbd2: mark sleeps on journal->j_checkpoint_mutex as iowait Tejun Heo
2016-10-28 16:58 ` [PATCH 1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule() Tejun Heo
2016-10-28 18:27   ` Peter Zijlstra
2016-10-28 19:07     ` Peter Zijlstra
2016-10-28 19:12       ` Tejun Heo
2016-10-29  3:21         ` Peter Zijlstra
2016-10-31 16:45           ` Tejun Heo
2016-12-06 21:30             ` Tejun Heo
2016-11-03 15:33   ` Pavan Kondeti
2016-11-08 22:51     ` Tejun Heo
2016-12-06 21:29   ` [PATCH v2 " Tejun Heo
2016-12-07  9:35     ` Peter Zijlstra
2016-12-07 20:48       ` Tejun Heo [this message]
2017-01-14 12:49         ` [tip:sched/core] sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler tip-bot for Tejun Heo
2016-10-28 16:58 ` [PATCH 2/4] sched: separate out io_schedule_prepare() and io_schedule_finish() Tejun Heo
2017-01-14 12:49   ` [tip:sched/core] sched/core: Separate " tip-bot for Tejun Heo
2016-10-28 16:58 ` [PATCH 3/4] mutex: add mutex_lock_io() Tejun Heo
2017-01-14 12:50   ` [tip:sched/core] locking/mutex, sched/wait: Add mutex_lock_io() tip-bot for Tejun Heo
2017-01-14 14:13     ` Mike Galbraith
2017-01-14 16:10       ` Ingo Molnar
2016-10-28 16:58 ` [PATCH 4/4] jbd2: use mutex_lock_io() for journal->j_checkpoint_mutex Tejun Heo
2017-01-14 12:51   ` [tip:sched/core] fs/jbd2, locking/mutex, sched/wait: Use " tip-bot for Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161207204841.GA22296@htj.duckdns.org \
    --to=tj@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=jack@suse.com \
    --cc=kernel-team@fb.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingbo@fb.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.