public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Libo Chen <libo.chen@oracle.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	peterz@infradead.org, vincent.guittot@linaro.org,
	21cnbao@gmail.com, dietmar.eggemann@arm.com,
	linux-kernel@vger.kernel.org, tglx@linutronix.de
Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
Date: Thu, 14 Jul 2022 15:18:34 +0100	[thread overview]
Message-ID: <20220714141834.GC3493@suse.de> (raw)
In-Reply-To: <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>

On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:
> 
> 
> On 7/13/22 09:40, Tim Chen wrote:
> > On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
> > > Barry Song first pointed out that replacing sync wakeup with regular wakeup
> > > seems to reduce overeager wakeup pulling and shows noticeable performance
> > > improvement.[1]
> > > 
> > > This patch argues that allowing sync for wakeups from interrupt context
> > > is a bug and fixing it can improve performance even when irq/softirq is
> > > evenly spread out.
> > > 
> > > For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> > > waker can be any random task that happens to be running on that CPU when the
> > > interrupt comes in. This is completely different from other wakups where the
> > > task running on the waking CPU is the actual waker. For example, two tasks
> > > communicate through a pipe or mutiple tasks access the same critical section,
> > > etc. This difference is important because with sync we assume the waker will
> > > get off the runqueue and go to sleep immedately after the wakeup. The
> > > assumption is built into wake_affine() where it discounts the waker's presence
> > > from the runqueue when sync is true. The random waker from interrupts bears no
> > > relation to the wakee and don't usually go to sleep immediately afterwards
> > > unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> > > preepmtion of waker for sync wakeup as it used to before
> > > patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> > > wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> > > appropriate too but at least sync wakeup will do what it's supposed to do.
> > Will there be scenarios where you do want the task being woken up be pulled
> > to the CPU where the interrupt happened, as the data that needs to be accessed is
> > on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
> > packets received.  Sync still seems desirable, at least if there is no task currently
> > running on the CPU where interrupt happened.  So maybe we should have some consideration
> > of the load on the CPU/NUMA before deciding whether we should do sync wake for such
> > interrupt.
> > 
> There are only two places where sync wakeup matters: wake_affine_idle() and
> wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable on the
> waking CPU because
> of the belief that this runnable will voluntarily get off the runqueue. In
> wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes to
> sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups from the
> interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it seems
> to be a reasonable
> assumption. For your idle case, I totally agree but I don't think having
> sync or not will actually
> have any impacts here giving what the code does. Real impact comes fromMel's
> patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle. I
> believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression has been
> spotted recently on our
> system due to this patch.
> 

The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
It still allowed cache-local migrations on the assumption that the incoming
data was critical enough to justify losing any other cache-hot data. You
state explicitly that "the interrupt CPU isn't as performance critical as
cache from its previous CPU" so that assumption was incorrect, at least
in your case. I don't have a counter example where the interrupt data *is*
more important than any other cache-hot data so the check can go.

I think a revert would not achieve what you want as a plain revert would
still allow an interrupt to pull a task from an arbitrary location as sync
is not checked. A follow-up to your patch or an updated version should not
check available_idle_cpu at all in wake_affine_idle as it's only idle the
wake is from interrupt context and vcpu_is_preempted is not necessarily
justification for pulling a task due to an interrupt.

Something like this but needs testing with your target loads, particularly
the RDS (Relational Database Service?) latency regression;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7b275672c89..e55a3a67a442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
  * soonest. For the purpose of speed we only consider the waking and previous
  * CPU.
  *
- * wake_affine_idle() - only considers 'now', it check if the waking CPU is
- *			cache-affine and is (or	will be) idle.
+ * wake_affine_idle() - only considers 'now', it checks if the waker task is a
+ *			sync wakeup from a CPU that should be idle soon.
  *
  * wake_affine_weight() - considers the weight to reflect the average
  *			  scheduling latency of the CPUs. This seems to work
@@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
 static int
 wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
-	/*
-	 * If this_cpu is idle, it implies the wakeup is from interrupt
-	 * context. Only allow the move if cache is shared. Otherwise an
-	 * interrupt intensive workload could force all tasks onto one
-	 * node depending on the IO topology or IRQ affinity settings.
-	 *
-	 * If the prev_cpu is idle and cache affine then avoid a migration.
-	 * There is no guarantee that the cache hot data from an interrupt
-	 * is more important than cache hot data on the prev_cpu and from
-	 * a cpufreq perspective, it's better to have higher utilisation
-	 * on one CPU.
-	 */
-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-
 	if (sync && cpu_rq(this_cpu)->nr_running == 1)
 		return this_cpu;
 

  parent reply	other threads:[~2022-07-14 14:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
2022-07-13 16:40 ` Tim Chen
     [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
2022-07-13 19:34     ` Libo Chen
2022-07-13 20:51     ` Tim Chen
2022-07-13 21:37       ` Libo Chen
2022-07-14 14:18     ` Mel Gorman [this message]
2022-07-14 20:21       ` Libo Chen
2022-07-15 10:07         ` Mel Gorman
2022-07-14 11:26 ` Peter Zijlstra
2022-07-14 11:35 ` Peter Zijlstra
2022-07-14 18:18   ` Libo Chen
2022-07-21  8:44 ` [tip: sched/core] sched/fair: Disallow " tip-bot2 for Libo Chen
2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
2022-08-01 13:26   ` Ingo Molnar
2022-08-01 14:59     ` Libo Chen
2022-08-03  9:18       ` Ingo Molnar
2022-08-03 19:37         ` Libo Chen
2022-08-04  8:55           ` Ingo Molnar
2022-08-04  9:51             ` Mel Gorman
2022-08-01 14:57   ` Libo Chen
2022-08-02  4:40     ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220714141834.GC3493@suse.de \
    --to=mgorman@suse.de \
    --cc=21cnbao@gmail.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=libo.chen@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox