Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Frederic Weisbecker <frederic@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Chris Metcalf <cmetcalf@mellanox.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	Christoph Lameter <cl@linux.com>,
	"Paul E . McKenney" <paulmck@linux.vnet.ibm.com>,
	Wanpeng Li <kernellwp@gmail.com>, Mike Galbraith <efault@gmx.de>,
	Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload
Date: Mon, 29 Jan 2018 17:48:33 +0100	[thread overview]
Message-ID: <20180129164832.GC2942@lerouge> (raw)
In-Reply-To: <20180129153839.GT2269@hirez.programming.kicks-ass.net>

On Mon, Jan 29, 2018 at 04:38:39PM +0100, Peter Zijlstra wrote:
1;4205;0c> On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> > 
> > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> > outsource these scheduler ticks to the global workqueue so that a
> > housekeeping CPU handles those remotely.
> > 
> > Note that in the case of using isolcpus, it's still up to the user to
> > affine the global workqueues to the housekeeping CPUs through
> > /sys/devices/virtual/workqueue/cpumask or domains isolation
> > "isolcpus=nohz,domain".
> 
> I would very much like a few words on why sched_class::task_tick() is
> safe to call remote -- from a quick look I think it actually is, but it
> would be good to have some words here.

Let's rather say I can't prove that it is safe, given the amount of code that
is behind throughout the various flavour of scheduler features.

But as far as I checked several times, it seems that nothing is accessed locally
on ::scheduler_tick(). Everything looks fetched from the runqueue struct target
while it is locked.

If we ever find local references such as "current" or "__this_cpu_*" in the path,
we'll have to fix them.

> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d72d0e9..c79500c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
> >  
> >  	return jiffies_to_nsecs(next - now);
> >  }
> > -#endif
> > +
> > +struct tick_work {
> > +	int			cpu;
> > +	struct delayed_work	work;
> > +};
> > +
> > +static struct tick_work __percpu *tick_work_cpu;
> > +
> > +static void sched_tick_remote(struct work_struct *work)
> > +{
> > +	struct delayed_work *dwork = to_delayed_work(work);
> > +	struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > +	int cpu = twork->cpu;
> > +	struct rq *rq = cpu_rq(cpu);
> > +	struct rq_flags rf;
> > +
> > +	/*
> > +	 * Handle the tick only if it appears the remote CPU is running
> > +	 * in full dynticks mode. The check is racy by nature, but
> > +	 * missing a tick or having one too much is no big deal.
> > +	 */
> > +	if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > +		rq_lock_irq(rq, &rf);
> > +		update_rq_clock(rq);
> > +		rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > +		rq_unlock_irq(rq, &rf);
> > +	}
> > +
> > +	queue_delayed_work(system_unbound_wq, dwork, HZ);
> 
> Do we want something that tracks the actual interrer arrival time of
> this work, such that we can detect and warn if the book-keeping thing is
> failing to keep up?

Yeah perhaps we can have some sort of check to make sure we got a tick after
some reasonable delay since the last sched in of the current remote task.

> 
> > +}
> > +
> > +static void sched_tick_start(int cpu)
> > +{
> > +	struct tick_work *twork;
> > +
> > +	if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > +		return;
> 
> This all looks very static :-(, you can't reconfigure this nohz_full
> crud after boot?

Unfortunately yes. In fact making the nohz interface dynamically available
through cpuset is the next big step.

> 
> > +	WARN_ON_ONCE(!tick_work_cpu);
> > +
> > +	twork = per_cpu_ptr(tick_work_cpu, cpu);
> > +	twork->cpu = cpu;
> > +	INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > +	queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +}
> 
> Similarly, I think we want a few words about how unbound workqueues are
> expected to behave vs NUMA.
> 
> AFAICT unbound workqueues by default prefer to run on a cpu in the same
> node, but if no cpu is available, it doesn't go looking for the nearest
> node that does have a cpu, it just punts to whatever random cpu.

Yes, and in fact you just made me look into wq_select_unbound_cpu() and
it looks worse than that. If the current CPU is not in the wq_unbound_cpumask,
a random one is picked up from that global cpumask without trying a near
one in the current node.

Looks like room for improvement on the workqueue side. I'll see what I can do.

Thanks.

next prev parent reply	other threads:[~2018-01-29 16:48 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-19  0:02 [GIT PULL] isolation: 1Hz residual tick offloading v4 Frederic Weisbecker
2018-01-19  0:02 ` [PATCH 1/6] sched: Rename init_rq_hrtick to hrtick_rq_init Frederic Weisbecker
2018-01-19  0:02 ` [PATCH 2/6] nohz: Allow to check if remote CPU tick is stopped Frederic Weisbecker
2018-01-19  0:02 ` [PATCH 3/6] sched/isolation: Isolate workqueues when "nohz_full=" is set Frederic Weisbecker
2018-01-19  0:02 ` [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload Frederic Weisbecker
2018-01-29 15:38   ` Peter Zijlstra
2018-01-29 16:48     ` Frederic Weisbecker [this message]
2018-01-29 17:20       ` Peter Zijlstra
2018-01-29 15:39   ` Peter Zijlstra
2018-01-19  0:02 ` [PATCH 5/6] sched/nohz: Remove the 1 Hz tick code Frederic Weisbecker
2018-01-19  0:02 ` [PATCH 6/6] sched/isolation: Tick offload documentation Frederic Weisbecker
2018-01-24 15:46 ` [GIT PULL] isolation: 1Hz residual tick offloading v4 Luiz Capitulino
2018-01-29  1:10   ` Frederic Weisbecker
2018-01-29 15:33     ` Luiz Capitulino
2018-01-29 15:54       ` Peter Zijlstra
2018-01-29 16:27         ` Luiz Capitulino
2018-01-29 15:47     ` Peter Zijlstra
2018-05-22 19:10     ` Yauheni Kaliuta
2018-05-25  2:56       ` Frederic Weisbecker
2018-05-25 12:51         ` Luiz Capitulino
2018-01-29  1:18 ` (Ping?) " Frederic Weisbecker
  -- strict thread matches above, loose matches on Subject: below --
2018-02-08 17:59 [PATCH 0/6] isolation: 1Hz residual tick offloading v5 Frederic Weisbecker
2018-02-08 17:59 ` [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload Frederic Weisbecker
2018-02-09  7:16   ` Ingo Molnar
2018-02-10 10:29     ` Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180129164832.GC2942@lerouge \
    --to=frederic@kernel.org \
    --cc=cl@linux.com \
    --cc=cmetcalf@mellanox.com \
    --cc=efault@gmx.de \
    --cc=kernellwp@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.