From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754253Ab0K2OW2 (ORCPT ); Mon, 29 Nov 2010 09:22:28 -0500 Received: from mail-pv0-f174.google.com ([74.125.83.174]:49885 "EHLO mail-pv0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753873Ab0K2OW0 (ORCPT ); Mon, 29 Nov 2010 09:22:26 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=S2230Ujw3x3HvHOOUMycLdQM+3boNEFu+vvvmfUXQZMzdbU55DPNiCSe3msLvypY82 yYxnZfWB/VyQgGStQiDpcvoKLVaTmSjpVspMxhzgpBVTUKxX4IBXLUPFyHUVLrL9gbEL l0M0mcbduTlbUN4eRGZxYuohOK044cRzR7Dcw= Date: Mon, 29 Nov 2010 22:22:13 +0800 From: Yong Zhang To: Peter Zijlstra Cc: mingo@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org, tglx@linutronix.de, venki@google.com, mingo@elte.hu, linux-tip-commits@vger.kernel.org Subject: Re: [tip:sched/core] sched: Do not account irq time to current task Message-ID: <20101129142213.GA2573@zhy> Reply-To: Yong Zhang References: <1286237003-12406-7-git-send-email-venki@google.com> <1291031990.32004.24.camel@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1291031990.32004.24.camel@laptop> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 29, 2010 at 12:59:50PM +0100, Peter Zijlstra wrote: > On Mon, 2010-11-29 at 16:45 +0800, Yong Zhang wrote: > > On Tue, Oct 19, 2010 at 3:26 AM, tip-bot for Venkatesh Pallipadi > > wrote: > > > Commit-ID: 305e6835e05513406fa12820e40e4a8ecb63743c > > > Gitweb: http://git.kernel.org/tip/305e6835e05513406fa12820e40e4a8ecb63743c > > > Author: Venkatesh Pallipadi > > > AuthorDate: Mon, 4 Oct 2010 17:03:21 -0700 > > > Committer: Ingo Molnar > > > CommitDate: Mon, 18 Oct 2010 20:52:26 +0200 > > > > > > sched: Do not account irq time to current task > > > > > > Scheduler accounts both softirq and interrupt processing times to the > > > currently running task. This means, if the interrupt processing was > > > for some other task in the system, then the current task ends up being > > > penalized as it gets shorter runtime than otherwise. > > > > > > Change sched task accounting to acoount only actual task time from > > > currently running task. Now update_curr(), modifies the delta_exec to > > > depend on rq->clock_task. > > > > > > Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can > > > extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats > > > for later. > > > > > > This change will impact scheduling behavior in interrupt heavy conditions. > > > > > > Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy > > > task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 > > > spending 75%+ of its time in irq processing. CPU 3 spending around 35% > > > time running nc task. > > > > > > Now, if I run another CPU intensive task on CPU 2, without this change > > > /proc//schedstat shows 100% of time accounted to this task. With this > > > change, it rightly shows less than 25% accounted to this task as remaining > > > time is actually spent on irq processing. > > > > > > Signed-off-by: Venkatesh Pallipadi > > > Signed-off-by: Peter Zijlstra > > > LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> > > > Signed-off-by: Ingo Molnar > > > --- > > > kernel/sched.c | 43 ++++++++++++++++++++++++++++++++++++++++--- > > > kernel/sched_fair.c | 6 +++--- > > > kernel/sched_rt.c | 8 ++++---- > > > 3 files changed, 47 insertions(+), 10 deletions(-) > > > > > > > [snip] > > > > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c > > > index ab77aa0..bea7d79 100644 > > > --- a/kernel/sched_rt.c > > > +++ b/kernel/sched_rt.c > > > @@ -609,7 +609,7 @@ static void update_curr_rt(struct rq *rq) > > > if (!task_has_rt_policy(curr)) > > > return; > > > > > > - delta_exec = rq->clock - curr->se.exec_start; > > > + delta_exec = rq->clock_task - curr->se.exec_start; > > > if (unlikely((s64)delta_exec < 0)) > > > delta_exec = 0; > > > > > > @@ -618,7 +618,7 @@ static void update_curr_rt(struct rq *rq) > > > curr->se.sum_exec_runtime += delta_exec; > > > account_group_exec_runtime(curr, delta_exec); > > > > > > - curr->se.exec_start = rq->clock; > > > + curr->se.exec_start = rq->clock_task; > > > cpuacct_charge(curr, delta_exec); > > > > > > sched_rt_avg_update(rq, delta_exec); > > > > Seems the above changes to update_curr_rt() have some false positive > > to rt_bandwidth control. > > For example: > > rt_period=1000000; > > rt_runtime=950000; > > then if in that period the irq_time is no zero(such as 50000), according to > > the throttle mechanism, rt is not throttled. In the end we left no > > time to others. > > It seems that this break the semantic of throttle. > > > > Maybe we can revert the change to update_curr_rt()? > > No, that's totally correct. > > Its the correct and desired behaviour, IRQ time is not time spend > running the RT tasks, hence they should not be accounted for it. Right. > > If you still want to throttle RT tasks simply ensure their bandwidth > constraint is lower than the available time. But the available time is harder to calculated than before. IRQ is random, so as to the irq_time. But the unthrottle(do_sched_rt_period_timer()) runs in fixed period which is based on hard clock. Is that what we want? Thanks, Yong