From: Mel Gorman <mgorman@suse.de>
To: Libo Chen <libo.chen@oracle.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
peterz@infradead.org, vincent.guittot@linaro.org,
21cnbao@gmail.com, dietmar.eggemann@arm.com,
linux-kernel@vger.kernel.org, tglx@linutronix.de
Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
Date: Thu, 14 Jul 2022 15:18:34 +0100 [thread overview]
Message-ID: <20220714141834.GC3493@suse.de> (raw)
In-Reply-To: <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:
>
>
> On 7/13/22 09:40, Tim Chen wrote:
> > On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
> > > Barry Song first pointed out that replacing sync wakeup with regular wakeup
> > > seems to reduce overeager wakeup pulling and shows noticeable performance
> > > improvement.[1]
> > >
> > > This patch argues that allowing sync for wakeups from interrupt context
> > > is a bug and fixing it can improve performance even when irq/softirq is
> > > evenly spread out.
> > >
> > > For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
> > > waker can be any random task that happens to be running on that CPU when the
> > > interrupt comes in. This is completely different from other wakups where the
> > > task running on the waking CPU is the actual waker. For example, two tasks
> > > communicate through a pipe or mutiple tasks access the same critical section,
> > > etc. This difference is important because with sync we assume the waker will
> > > get off the runqueue and go to sleep immedately after the wakeup. The
> > > assumption is built into wake_affine() where it discounts the waker's presence
> > > from the runqueue when sync is true. The random waker from interrupts bears no
> > > relation to the wakee and don't usually go to sleep immediately afterwards
> > > unless wakeup granularity is reached. Plus the scheduler no longer enforces the
> > > preepmtion of waker for sync wakeup as it used to before
> > > patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
> > > wakeup preemption for wakeups from interrupt contexts doesn't seem to be
> > > appropriate too but at least sync wakeup will do what it's supposed to do.
> > Will there be scenarios where you do want the task being woken up be pulled
> > to the CPU where the interrupt happened, as the data that needs to be accessed is
> > on local CPU/NUMA that interrupt happened? For example, interrupt associated with network
> > packets received. Sync still seems desirable, at least if there is no task currently
> > running on the CPU where interrupt happened. So maybe we should have some consideration
> > of the load on the CPU/NUMA before deciding whether we should do sync wake for such
> > interrupt.
> >
> There are only two places where sync wakeup matters: wake_affine_idle() and
> wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable on the
> waking CPU because
> of the belief that this runnable will voluntarily get off the runqueue. In
> wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes to
> sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups from the
> interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it seems
> to be a reasonable
> assumption. For your idle case, I totally agree but I don't think having
> sync or not will actually
> have any impacts here giving what the code does. Real impact comes fromMel's
> patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle. I
> believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression has been
> spotted recently on our
> system due to this patch.
>
The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
It still allowed cache-local migrations on the assumption that the incoming
data was critical enough to justify losing any other cache-hot data. You
state explicitly that "the interrupt CPU isn't as performance critical as
cache from its previous CPU" so that assumption was incorrect, at least
in your case. I don't have a counter example where the interrupt data *is*
more important than any other cache-hot data so the check can go.
I think a revert would not achieve what you want as a plain revert would
still allow an interrupt to pull a task from an arbitrary location as sync
is not checked. A follow-up to your patch or an updated version should not
check available_idle_cpu at all in wake_affine_idle as it's only idle the
wake is from interrupt context and vcpu_is_preempted is not necessarily
justification for pulling a task due to an interrupt.
Something like this but needs testing with your target loads, particularly
the RDS (Relational Database Service?) latency regression;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7b275672c89..e55a3a67a442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
* soonest. For the purpose of speed we only consider the waking and previous
* CPU.
*
- * wake_affine_idle() - only considers 'now', it check if the waking CPU is
- * cache-affine and is (or will be) idle.
+ * wake_affine_idle() - only considers 'now', it checks if the waker task is a
+ * sync wakeup from a CPU that should be idle soon.
*
* wake_affine_weight() - considers the weight to reflect the average
* scheduling latency of the CPUs. This seems to work
@@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
static int
wake_affine_idle(int this_cpu, int prev_cpu, int sync)
{
- /*
- * If this_cpu is idle, it implies the wakeup is from interrupt
- * context. Only allow the move if cache is shared. Otherwise an
- * interrupt intensive workload could force all tasks onto one
- * node depending on the IO topology or IRQ affinity settings.
- *
- * If the prev_cpu is idle and cache affine then avoid a migration.
- * There is no guarantee that the cache hot data from an interrupt
- * is more important than cache hot data on the prev_cpu and from
- * a cpufreq perspective, it's better to have higher utilisation
- * on one CPU.
- */
- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-
if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;
next prev parent reply other threads:[~2022-07-14 14:18 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
2022-07-13 16:40 ` Tim Chen
[not found] ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
2022-07-13 19:34 ` Libo Chen
2022-07-13 20:51 ` Tim Chen
2022-07-13 21:37 ` Libo Chen
2022-07-14 14:18 ` Mel Gorman [this message]
2022-07-14 20:21 ` Libo Chen
2022-07-15 10:07 ` Mel Gorman
2022-07-14 11:26 ` Peter Zijlstra
2022-07-14 11:35 ` Peter Zijlstra
2022-07-14 18:18 ` Libo Chen
2022-07-21 8:44 ` [tip: sched/core] sched/fair: Disallow " tip-bot2 for Libo Chen
2022-07-29 4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
2022-08-01 13:26 ` Ingo Molnar
2022-08-01 14:59 ` Libo Chen
2022-08-03 9:18 ` Ingo Molnar
2022-08-03 19:37 ` Libo Chen
2022-08-04 8:55 ` Ingo Molnar
2022-08-04 9:51 ` Mel Gorman
2022-08-01 14:57 ` Libo Chen
2022-08-02 4:40 ` K Prateek Nayak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220714141834.GC3493@suse.de \
--to=mgorman@suse.de \
--cc=21cnbao@gmail.com \
--cc=dietmar.eggemann@arm.com \
--cc=libo.chen@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tim.c.chen@linux.intel.com \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.