From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754126Ab0LGNOA (ORCPT ); Tue, 7 Dec 2010 08:14:00 -0500 Received: from e2.ny.us.ibm.com ([32.97.182.142]:42480 "EHLO e2.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753705Ab0LGNN7 (ORCPT ); Tue, 7 Dec 2010 08:13:59 -0500 Date: Tue, 7 Dec 2010 18:43:33 +0530 From: Bharata B Rao To: Balbir Singh Cc: linux-kernel@vger.kernel.org, Dhaval Giani , Vaidyanathan Srinivasan , Srivatsa Vaddagiri , Kamalesh Babulal , Ingo Molnar , Peter Zijlstra , Pavel Emelyanov , Herbert Poetzl , Avi Kivity , Chris Friesen , Paul Menage , Mike Waychison , Paul Turner , Nikhil Rao Subject: Re: [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Message-ID: <20101207131333.GA10723@in.ibm.com> Reply-To: bharata@linux.vnet.ibm.com References: <20101012074910.GA9893@in.ibm.com> <20101012075247.GE9893@in.ibm.com> <20101015044552.GI13048@balbir.in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20101015044552.GI13048@balbir.in.ibm.com> User-Agent: Mutt/1.5.19 (2009-01-05) X-Content-Scanned: Fidelis XPS MAILER Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry Balbir, didn't realize that we haven't replied to these comments from you. On Fri, Oct 15, 2010 at 10:15:52AM +0530, Balbir Singh wrote: > * Bharata B Rao [2010-10-12 13:22:47]: > > > sched: unthrottle cfs_rq(s) who ran out of quota at period refresh > > > > From: Paul Turner > > > > At the start of a new period there are several actions we must take: > > - Refresh global bandwidth pool > > - Unthrottle entities who ran out of quota as refreshed bandwidth permits > > > > Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued > > into the cfs entity hierarchy. > > > > Am I reading this right? Yes, Needs to be corrected. Thanks. > > > sched_rt_period_mask() is refactored slightly into sched_bw_period_mask() > > since it is now shared by both cfs and rt bandwidth period timers. > > > > The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use > > rd->span instead of cpu_online_mask since I think that was incorrect before > > (don't want to hit cpu's outside of your root_domain for RT bandwidth). > > > > Signed-off-by: Paul Turner > > Signed-off-by: Nikhil Rao > > Signed-off-by: Bharata B Rao > > --- > > kernel/sched.c | 16 ++++++++++++ > > kernel/sched_fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > kernel/sched_rt.c | 19 -------------- > > 3 files changed, 84 insertions(+), 19 deletions(-) > > > > --- a/kernel/sched.c > > +++ b/kernel/sched.c > > @@ -1565,6 +1565,8 @@ static int tg_nop(struct task_group *tg, > > } > > #endif > > > > +static inline const struct cpumask *sched_bw_period_mask(void); > > + > > #ifdef CONFIG_SMP > > /* Used instead of source_load when we know the type == 0 */ > > static unsigned long weighted_cpuload(const int cpu) > > @@ -1933,6 +1935,18 @@ static inline void __set_task_cpu(struct > > > > static const struct sched_class rt_sched_class; > > > > +#ifdef CONFIG_SMP > > +static inline const struct cpumask *sched_bw_period_mask(void) > > +{ > > + return cpu_rq(smp_processor_id())->rd->span; > > +} > > +#else > > +static inline const struct cpumask *sched_bw_period_mask(void) > > +{ > > + return cpu_online_mask; > > +} > > +#endif > > + > > #ifdef CONFIG_CFS_BANDWIDTH > > /* > > * default period for cfs group bandwidth. > > @@ -8937,6 +8951,8 @@ static int tg_set_cfs_bandwidth(struct t > > > > raw_spin_lock_irq(&rq->lock); > > init_cfs_rq_quota(cfs_rq); > > + if (cfs_rq_throttled(cfs_rq)) > > + unthrottle_cfs_rq(cfs_rq); > > raw_spin_unlock_irq(&rq->lock); > > } > > mutex_unlock(&mutex); > > --- a/kernel/sched_fair.c > > +++ b/kernel/sched_fair.c > > @@ -268,6 +268,13 @@ find_matching_se(struct sched_entity **s > > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > > > #ifdef CONFIG_CFS_BANDWIDTH > > +static inline > > +struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu) > > +{ > > Nit pick, but I'd call this function cfs_bandwidth_cfs_cpu_rq > > > + return container_of(cfs_b, struct task_group, > > + cfs_bandwidth)->cfs_rq[cpu]; > > +} > > + > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > > { > > return &tg->cfs_bandwidth; > > @@ -1219,6 +1226,29 @@ out_throttled: > > cfs_rq->throttled = 1; > > } > > > > +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > +{ > > + struct sched_entity *se; > > + struct rq *rq = rq_of(cfs_rq); > > + > > + se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; > > + > > + cfs_rq->throttled = 0; > > + for_each_sched_entity(se) { > > + if (se->on_rq) > > + break; > > + > > + cfs_rq = cfs_rq_of(se); > > + enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP); > > Should we really enqueue with ENQUEUE_WAKEUP - the task throttled, not > slept. Yes, but the actions are quite similar. In both the cases, they go off the runqueque and get enqueued back. I see two (side)effects of using DEQUEUE_SLEEP during dequeue and ENQUEUE_WAKEUP during enqueue: - vruntime normalization isn't done for throttled entities. This should be fine since they don't get pulled around when throttled. - vruntime of throttled entities are re-calculated during unthrottling(enqueue). This will ensure that throttled entities don't get undue vruntime-advantage when they are enqueued back. This is my understanding. I would request Paul to comment here. > > > + if (cfs_rq_throttled(cfs_rq)) > > + break; > > + } > > + > > + /* determine whether we need to wake up potentally idle cpu */ > > + if (rq->curr == rq->idle && rq->cfs.nr_running) > > + resched_task(rq->curr); > > +} > > + > > static void account_cfs_rq_quota(struct cfs_rq *cfs_rq, > > unsigned long delta_exec) > > { > > @@ -1241,8 +1271,44 @@ static void account_cfs_rq_quota(struct > > > > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun) > > { > > - return 1; > > + int i, idle = 1; > > + u64 delta; > > + const struct cpumask *span; > > + > > + if (cfs_b->quota == RUNTIME_INF) > > + return 1; > > I am afraid I don't understand how return codes are being used here. > idle is set to 1 if there are no running tasks across all CPUs. Why do > we return a 1 from here? Remember we are in hrtimer handler here, returning 1 will ensure that hrtimer isn't restarted. So we don't restart the timer for a group that isn't bandwidth constrained. Regards, Bharata.