From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755901AbaAVQZ2 (ORCPT ); Wed, 22 Jan 2014 11:25:28 -0500 Received: from g1t0029.austin.hp.com ([15.216.28.36]:45227 "EHLO g1t0029.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751379AbaAVQZ1 (ORCPT ); Wed, 22 Jan 2014 11:25:27 -0500 Message-ID: <52DFF0DC.3050303@hp.com> Date: Wed, 22 Jan 2014 11:25:00 -0500 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Peter Zijlstra CC: Ingo Molnar , linux-kernel@vger.kernel.org, "Paul E. McKenney" , Frederic Weisbecker , "Eric W. Biederman" , Andrew Morton , Serge Hallyn , Aswin Chandramouleeswaran , Scott J Norton , Paul Turner , Ben Segall Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg References: <1389838956-56574-1-git-send-email-Waiman.Long@hp.com> <20140116124457.GR31570@twins.programming.kicks-ass.net> In-Reply-To: <20140116124457.GR31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/16/2014 07:44 AM, Peter Zijlstra wrote: > First of.. WTF is v1? > > Secondly, please always CC the authors of the code you're changing. The v1 patch was sent quite a while ago on 9/21/2013. See https://lkml.org/lkml/2013/9/23/551 There was no feedback at that time. As this was not a high priority patch for me, I didn't follow up at that time. I will include the change log in the next version. > On Wed, Jan 15, 2014 at 09:22:36PM -0500, Waiman Long wrote: >> It was found that with a perf profile of a compute workload (at 1500 >> users) of the AIM7 benchmark running on a glueless 4-socket 40-core >> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling >> tick related functions account for quite a significant portion of >> the total kernel cpu cycles. >> >> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load >> 0.47% reaim [kernel.kallsyms] [k] entity_tick >> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares >> 0.03% reaim [kernel.kallsyms] [k] update_curr >> >> The scheduling tick functions account for about 1.22% of the total >> CPU cycles. Of the top 2 function in the above list, the reading >> and writing of the tg->load_avg variable account for over 90% of the >> CPU cycles: >> >> atomic_long_add(tg_contrib,&tg->load_avg); >> atomic_long_read(&tg->load_avg) + 1); >> >> This patch reduces the contention on the load_avg variable (and >> secondarily on the runnable_avg variable) by the following 2 measures: >> >> 1. Make the load_avg and runnable_avg fields of the task_group >> structure sit in their own cacheline without sharing it with others. >> This only applies if the kernel is built for NUMA systems with >> multiple sockets. > So why not for SMP? The cache coherency traffic is generally not a problem for single-socket multi-core system, that is why I currently increase the data structure size only for kernel built for multi-socket systems. Of course, I can also enable it for SMP system in general. > Also, what's the difference between having both of them in the same > cacheline as opposed to a cacheline each? > They're both touched from the same tick, so it makes sense to have them > in one cacheline. Now you get to move two lines into exclusive state, > instead of just the one. Below is the performance data for different cacheline placements: Cacheline Placement | %CPU | JPM | ---------------------+-------+--------+ 2 separate cachelines| 0.55% | 405803 | 1 common cacheline | 1.01% | 403462 | 2nd change only | 1.06% | 403820 | Original code | 1.22% | 398509 | It seems like forcing the 2 fields to be in the same cacheline actually make it perform a little bit worse. It is likely that the 2 fields just happen to be in 2 different cachelines in x86. >> 2. Use atomic_long_add_return() to update the fields and save the >> returned value in a temporary location in the cfs structure to >> be used later instead of reading the fields directly. > Then why aren't this two patches? I will break it into 2 patches. > Furthermore, I completely hate the way you implemented this; the stuff > like in the first hunk below makes the entire code flow horrid. Its > already difficult code, using conditional variables makes it even worse. I can try to encapsulate the change in macros to not disrupt the current flow. > Who's to say your 'cached' value is recent? You didn't put in a call > chain analysis to show you always first pass through the add_return() > before using the cached value. Will provide a more detailed call chain analysis to show when and how the cache value is used. >> The second change does require some changes in the ordering of how >> some of the average counts are being computed and hence may have a >> slight effect on their behavior. > Might have is no good, either you work through it and make damn sure its > solid or you walk. I will do a more detailed analysis and provide that in the change log. > Preserved the rest for the added Cc's. > Will do. -Longman