From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935319AbcHJSb4 (ORCPT ); Wed, 10 Aug 2016 14:31:56 -0400 Received: from mail-wm0-f42.google.com ([74.125.82.42]:36520 "EHLO mail-wm0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753398AbcHJSbv (ORCPT ); Wed, 10 Aug 2016 14:31:51 -0400 Date: Wed, 10 Aug 2016 13:26:41 +0200 From: Ingo Molnar To: Giovanni Gherdovich Cc: Ingo Molnar , Peter Zijlstra , Mike Galbraith , Stanislaw Gruszka , linux-kernel@vger.kernel.org, Mel Gorman Subject: Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Message-ID: <20160810112641.GA30126@gmail.com> References: <1470385316-15027-1-git-send-email-ggherdovich@suse.cz> <1470385316-15027-2-git-send-email-ggherdovich@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1470385316-15027-2-git-send-email-ggherdovich@suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Giovanni Gherdovich wrote: > Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() > inconsistency") fixed a problem whereby clock_nanosleep() followed by > clock_gettime() could allow a task to wake early. It addressed the problem > by calling the scheduling classes update_curr when the cputimer starts. > > Said change induced a considerable performance regression on the syscalls > times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some > debuggers and applications that monitor their own performance that > accidentally depend on the performance of these specific calls. > > This patch mitigates the performace loss by prefetching data in the CPU > cache, as stalls due to cache misses appear to be where most time is spent > in our benchmarks. > > Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge > box with 32 logical cores and 2 NUMA nodes. The test is repeated with a > variable number of threads, from 2 to 4*num_cpus; the results are in > seconds and correspond to the average of 10 runs; the percentage gain is > computed with (before-after)/before so a positive value is an improvement > (it's faster). The improvement varies between a few percents for 5-20 > threads and more than 10% for 2 or >20 threads. > > pound_clock_gettime: > > threads 4.7-rc7 patched 4.7-rc7 > [num] [secs] [secs (percent)] > 2 3.48 3.06 ( 11.83%) > 5 3.33 3.25 ( 2.40%) > 8 3.37 3.26 ( 3.30%) > 12 3.32 3.37 ( -1.60%) > 21 4.01 3.90 ( 2.74%) > 30 3.63 3.36 ( 7.41%) > 48 3.71 3.11 ( 16.27%) > 79 3.75 3.16 ( 15.74%) > 110 3.81 3.25 ( 14.80%) > 128 3.88 3.31 ( 14.76%) Nice detective work! I'm wondering, where do we stand if compared with a pre-6e998916dfe3 kernel? I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I suspect v3.17 does not run easily on a recent distro. Could you attempt to revert the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to make the result correct, just see what the performance gap is, roughly. If there's still a significant gap then it might make sense to optimize this some more. Thanks, Ingo