From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S263843AbUECSlC (ORCPT ); Mon, 3 May 2004 14:41:02 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S263851AbUECSlB (ORCPT ); Mon, 3 May 2004 14:41:01 -0400 Received: from cfcafw.SGI.COM ([198.149.23.1]:33508 "EHLO omx1.americas.sgi.com") by vger.kernel.org with ESMTP id S263843AbUECSkT (ORCPT ); Mon, 3 May 2004 14:40:19 -0400 Date: Mon, 3 May 2004 13:40:06 -0500 From: Jack Steiner To: "Paul E. McKenney" Cc: linux-kernel@vger.kernel.org Subject: Re: RCU scaling on large systems Message-ID: <20040503184006.GA10721@sgi.com> References: <20040501120805.GA7767@sgi.com> <20040502182811.GA1244@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040502182811.GA1244@us.ibm.com> User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Paul - Thanks for the reply. Additional data from experiments today. As expected, there are multiple hot spots related to the rcu_ctrlblk. - scheduler_tick() in the rcu_pending macro. Specifically, on the load of the rcu_cpu_mask. - rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex); These two spots are are ~equally hot. Some of the cache line contention could be alleviated by separating these fields into multiple cache lines. Wli posted a patch over the weekend that does that. I have not had a chance to review the patch in detail but it looks a reasonable idea. ----------------- Response to your mail: > >From your numbers below, I would guess that if you have at least > 8 CPUs per NUMA node, a two-level tree would suffice. If you have > only 4 CPUs per NUMA node, you might well need a per-node level, > a per-4-nodes level, and a global level to get the global lock > contention reduced sufficiently. The system consists of 256 nodes. Each node has 2 cpus located on a shared FSB. The nodes are packaged as 128 modules - 2 nodes per module. The 2 nodes in a module are slightly "closer" latency-wise than nodes in different modules. > > > I also found an interesting anomaly that was traced to RCU. I have > > a program that measures "cpu efficiency". Basically, the program > > creates a cpu bound thread for each cpu & measures the percentage > > of time that each cpu ACTUALLY spends executing user code. > > On an idle each system, each cpu *should* spend >99% in user mode. > > > > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel > > RCU code. The time is spent contending for shared cache lines. > > Again, no surprise, Linux's RCU was not designed for a 512-CPU > system. ;-) > > The hierarchical grace-period-detection scheme described above > also increases cache locality, greatly reducing the cache-thrashing > you are seeing. > > > Even more bizarre: if I repeatedly type "ls" in a *single* window > > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50% > > of the time in the RCU code. > > Hmmm... How large was the directory you were running "ls" on? > At first blush, it sounds like the "ls" was somehow provoking > a dcache update, which would then exercise RCU. The directory size does not seem to be too significant. I tried one test on a NFS directory with 250 files. Another test on /tmp with 25 files. In both cases, the results were similar. > > > The RCU algorithms don't scale - at least on our systems!!!!! > > As noted earlier, the current implementation is not designed for > 512 CPUs. And, as noted earlier, there are ways to make it > scale. But for some reason, we felt it advisable to start with > a smaller, simpler, and hence less scalable implementation. ;-) Makes sense. I would not have expected otherwise. Overall, linux scales to 512p much better than I would have predicted. Is anyone working on improving RCU scaling to higher cpu counts. I dont want to duplicate any work that is already in progress. Otherwise, I'll start investigating what can be done to improve scaling. > > > Attached is an experimental hack that fixes the problem. I > > don't believe that this is the correct fix but it does prove > > that slowing down the frequency of updates fixes the problem. > > > > > > With this hack, "ls" no longer measurable disturbs other cpus. Each > > cpu spends ~99.8% of its time in user code regardless of the frequency > > of typing "ls". > > > > > > > > By default, the RCU code attempts to process callbacks on each cpu > > every tick. The hack adds a mask so that only a few cpus process > > callbacks each tick. > > Cute! However, it is not clear to me that this approach is > compatible with real-time use of RCU, since it results in CPUs > processing their callbacks less frequently, and thus getting > more of them to process at a time. > > But it is not clear to me that anyone is looking for realtime > response from a 512-CPU machine (yow!!!), so perhaps this > is not a problem... Agree on both counts. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc.