From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753444AbcHZPYT (ORCPT ); Fri, 26 Aug 2016 11:24:19 -0400 Received: from mx2.suse.de ([195.135.220.15]:34148 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753280AbcHZPYR (ORCPT ); Fri, 26 Aug 2016 11:24:17 -0400 Message-ID: <1472225066.1821.24.camel@suse.cz> Subject: Re: [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance From: Giovanni Gherdovich To: Stanislaw Gruszka , linux-kernel@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Mike Galbraith , Mel Gorman Date: Fri, 26 Aug 2016 17:24:26 +0200 In-Reply-To: <20160817093043.GA25206@redhat.com> References: <20160817093043.GA25206@redhat.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2016-08-17 at 11:30 +0200, Stanislaw Gruszka wrote: > Commit d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles") makes we > account thread group tasks pending runtime in thread_group_cputime(). > Another commit 6e998916dfe32 ("sched/cputime: > Fix clock_nanosleep()/clock_gettime() inconsistency") makes we update > scheduler runtime statistics (call update_curr()) when read task pending > runtime. Those changes cause bad performance of times() and > clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls. > > While we would like to have cpuclock monotonicity kept i.e. have > problems fixed by above commits stay fixed, we also would like to have > good performance. > > [... snip ...] > > Reported-and-tested-by: Giovanni Gherdovich > Signed-off-by: Stanislaw Gruszka > --- > kernel/sched/cputime.c | 33 ++++++++++++++++++++++++++++++++- > 1 file changed, 32 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 1934f65..4fca604 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -301,6 +301,26 @@ static inline cputime_t account_other_time(cputime_t max) > return accounted; > } > > +#ifdef CONFIG_64BIT > +static inline u64 read_sum_exec_runtime(struct task_struct *t) > +{ > + return t->se.sum_exec_runtime; > +} > +#else > +static u64 read_sum_exec_runtime(struct task_struct *t) > +{ > + u64 ns; > + struct rq_flags rf; > + struct rq *rq; > + > + rq = task_rq_lock(t, &rf); > + ns = t->se.sum_exec_runtime; > + task_rq_unlock(rq, t, &rf); > + > + return ns; > +} > +#endif > + > /* > * Accumulate raw cputime values of dead tasks (sig->[us]time) and live > * tasks (sum on group iteration) belonging to @tsk's group. > @@ -313,6 +333,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > unsigned int seq, nextseq; > unsigned long flags; > > + /* > + * Update current task runtime to account pending time since last > + * scheduler action or thread_group_cputime() call. This thread group > + * might have other running tasks on different CPUs, but updating > + * their runtime can affect syscall performance, so we skip account > + * those pending times and rely only on values updated on tick or > + * other scheduler action. > + */ > + if (same_thread_group(current, tsk)) > + (void) task_sched_runtime(current); > + > rcu_read_lock(); > /* Attempt a lockless read on the first round. */ > nextseq = 0; > @@ -327,7 +358,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) > task_cputime(t, &utime, &stime); > times->utime += utime; > times->stime += stime; > - times->sum_exec_runtime += task_sched_runtime(t); > + times->sum_exec_runtime += read_sum_exec_runtime(t); > } > /* If lockless access failed, take the lock. */ > nextseq = 1; Hello Stanislaw and all, I know I'm quite late to the party as this patch is already taken in Ingo's "tip" repo, but I want to chime in anyway and give my positive review and acknowledgment of the patch. The patch works as advertised in the commit message; the time accounting behaviour you're changing is consistent with what happened before d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles", i.e. only the runtime statistics for the current task are up-to-date and not those for all the other threads in the group. As you say, that's how things used to work -- I'm favorable to this trade-off. You correctly address Mel Gorman's remark ("how do you know that tsk == current?") by using the "current" macro when you call task_sched_runtime. As you note, task_sched_runtime(current) (which in turns call update_curr on that task) is all you need to solve the problem of "the diff of 'process' should always be >= the diff of 'thread'" that you initially addressed in your 6e998916df "sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency". Acked-by: Giovanni Gherdovich -- Giovanni Gherdovich SUSE Labs