Re: RCU scaling on large systems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jack Steiner <steiner@sgi.com>
To: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU scaling on large systems
Date: Mon, 3 May 2004 13:40:06 -0500	[thread overview]
Message-ID: <20040503184006.GA10721@sgi.com> (raw)
In-Reply-To: <20040502182811.GA1244@us.ibm.com>

Paul - 

Thanks for the reply.

Additional data from experiments today.

As expected, there are multiple hot spots related to the rcu_ctrlblk.

	- scheduler_tick() in the rcu_pending macro. Specifically, on the
	  load of the rcu_cpu_mask.

	- rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex);

These two spots are are ~equally hot. 

Some of the cache line contention could be alleviated by separating 
these fields into multiple cache lines.  Wli posted a patch over the 
weekend that does that. I have not had a chance to review the patch in 
detail but it looks a reasonable idea.

-----------------
Response to your mail:

> >From your numbers below, I would guess that if you have at least
> 8 CPUs per NUMA node, a two-level tree would suffice.  If you have
> only 4 CPUs per NUMA node, you might well need a per-node level,
> a per-4-nodes level, and a global level to get the global lock
> contention reduced sufficiently.

The system consists of 256 nodes. Each node has 2 cpus located on
a shared FSB. The nodes are packaged as 128 modules - 2 nodes per
module. The 2 nodes in a module are slightly "closer" latency-wise 
than nodes in different modules. 

> 
> > I also found an interesting anomaly that was traced to RCU. I have
> > a program that measures "cpu efficiency". Basically, the program 
> > creates a cpu bound thread for each cpu & measures the percentage 
> > of time that each cpu ACTUALLY spends executing user code.
> > On an idle each system, each cpu *should* spend >99% in user mode.
> > 
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
> 
> Again, no surprise, Linux's RCU was not designed for a 512-CPU
> system.  ;-)
> 
> The hierarchical grace-period-detection scheme described above
> also increases cache locality, greatly reducing the cache-thrashing
> you are seeing.
> 
> > Even more bizarre: if I repeatedly type "ls" in a *single* window 
> > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
> > of the time in the RCU code.
> 
> Hmmm...  How large was the directory you were running "ls" on?
> At first blush, it sounds like the "ls" was somehow provoking
> a dcache update, which would then exercise RCU.

The directory size does not seem to be too significant. I tried one test 
on a NFS directory with 250 files. Another test on /tmp with 25 files.
In both cases, the results were similar. 

> 
> > The RCU algorithms don't scale - at least on our systems!!!!!
> 
> As noted earlier, the current implementation is not designed for
> 512 CPUs.  And, as noted earlier, there are ways to make it
> scale.  But for some reason, we felt it advisable to start with
> a smaller, simpler, and hence less scalable implementation.  ;-)

Makes sense. I would not have expected otherwise. Overall, linux scales
to 512p much better than I would have predicted. 

Is anyone working on improving RCU scaling to higher cpu counts. I
dont want to duplicate any work that is already in progress.
Otherwise, I'll start investigating what can be done to improve
scaling. 

> 
> > Attached is an experimental hack that fixes the problem. I
> > don't believe that this is the correct fix but it does prove
> > that slowing down the frequency of updates fixes the problem.
> > 
> > 
> > With this hack, "ls" no longer measurable disturbs other cpus. Each
> > cpu spends ~99.8% of its time in user code regardless of the frequency
> > of typing "ls".
> > 
> > 
> > 
> > By default, the RCU code attempts to process callbacks on each cpu
> > every tick. The hack adds a mask so that only a few cpus process
> > callbacks each tick. 
> 
> Cute!  However, it is not clear to me that this approach is
> compatible with real-time use of RCU, since it results in CPUs
> processing their callbacks less frequently, and thus getting
> more of them to process at a time.
> 
> But it is not clear to me that anyone is looking for realtime
> response from a 512-CPU machine (yow!!!), so perhaps this
> is not a problem...

Agree on both counts.

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

next prev parent reply	other threads:[~2004-05-03 18:41 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-01 12:08 RCU scaling on large systems Jack Steiner
2004-05-01 21:17 ` William Lee Irwin III
2004-05-01 22:35   ` William Lee Irwin III
2004-05-02  1:38   ` Jack Steiner
2004-05-07 17:53   ` Andrea Arcangeli
2004-05-07 18:17     ` William Lee Irwin III
2004-05-07 19:59       ` Andrea Arcangeli
2004-05-07 20:49   ` Jack Steiner
2004-05-02 18:28 ` Paul E. McKenney
2004-05-03 16:39   ` Jesse Barnes
2004-05-03 20:04     ` Paul E. McKenney
2004-05-03 18:40   ` Jack Steiner [this message]
2004-05-07 20:50     ` Paul E. McKenney
2004-05-07 22:06       ` Jack Steiner
2004-05-07 23:32         ` Andrew Morton
2004-05-08  4:55           ` Jack Steiner
2004-05-17 21:18           ` Andrea Arcangeli
2004-05-17 21:42             ` Andrew Morton
2004-05-17 23:50               ` Andrea Arcangeli
2004-05-18 13:33               ` Jack Steiner
2004-05-18 23:13               ` Matt Mackall
  -- strict thread matches above, loose matches on Subject: below --
2004-05-20 11:36 Manfred Spraul

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040503184006.GA10721@sgi.com \
    --to=steiner@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox