Re: RCU scaling on large systems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jack Steiner <steiner@sgi.com>
To: William Lee Irwin III <wli@holomorphy.com>, linux-kernel@vger.kernel.org
Subject: Re: RCU scaling on large systems
Date: Fri, 7 May 2004 15:49:38 -0500	[thread overview]
Message-ID: <20040507204938.GA27428@sgi.com> (raw)
In-Reply-To: <20040501211704.GY1445@holomorphy.com>

On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
> 
> Would something like this help cacheline contention? This uses the
> per_cpu data areas to hold per-cpu booleans for needing switches.
> Untested/uncompiled.
> 
> The global lock is unfortunately still there.
> 
> 
> -- wli
> 
...
>  
> +#if RCU_CPU_SCATTER
> +#define rcu_need_switch(cpu)		(!!atomic_read(&per_cpu(rcu_data, cpu).need_switch))
> +#define rcu_clear_need_switch(cpu)	atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0)
> +static inline int rcu_any_cpu_need_switch(void)
> +{
> +	int cpu;
> +	for_each_online_cpu(cpu) {
> +		if (rcu_need_switch(cpu))
> +			return 1;
> +	}
> +	return 0;
> +}
..

This fixes only part of the problem.

Referencing percpu data for every cpu is not particularily efficient - at
least on our platform.  Percpu data is allocated so that it is on the 
local node of each cpu.

We use 64MB granules. The percpu data structures on the individual nodes 
are separated by addresses that differ by many GB. A scan of all percpu 
data structures requires a TLB entry for each node.  This is costly & 
trashes the TLB.  (Our max system size is currently 256 nodes).

For example, a 4 node, 2 cpus/node system shows:

	mdb> m d __per_cpu_offset
        <0xa000000100b357c8> 0xe000003001010000
        <0xa000000100b357d0> 0xe000003001020000
        <0xa000000100b357d8> 0xe00000b001020000
        <0xa000000100b357e0> 0xe00000b001030000
        <0xa000000100b357e8> 0xe000013001030000
        <0xa000000100b357f0> 0xe000013001040000
        <0xa000000100b357f8> 0xe00001b001040000
        <0xa000000100b35800> 0xe00001b001050000

The node number is encoded in bits [48:39] of the virtual/physical
address.

Unfortunately, our hardware does not provide a way to allocate node
local memory for every node and have all the memory covered by a single
TLB entry.

Moving "need_switch" to a single array with cacheline aligned
entries would work. I can give that a try.....

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

next prev parent reply	other threads:[~2004-05-07 20:51 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-01 12:08 RCU scaling on large systems Jack Steiner
2004-05-01 21:17 ` William Lee Irwin III
2004-05-01 22:35   ` William Lee Irwin III
2004-05-02  1:38   ` Jack Steiner
2004-05-07 17:53   ` Andrea Arcangeli
2004-05-07 18:17     ` William Lee Irwin III
2004-05-07 19:59       ` Andrea Arcangeli
2004-05-07 20:49   ` Jack Steiner [this message]
2004-05-02 18:28 ` Paul E. McKenney
2004-05-03 16:39   ` Jesse Barnes
2004-05-03 20:04     ` Paul E. McKenney
2004-05-03 18:40   ` Jack Steiner
2004-05-07 20:50     ` Paul E. McKenney
2004-05-07 22:06       ` Jack Steiner
2004-05-07 23:32         ` Andrew Morton
2004-05-08  4:55           ` Jack Steiner
2004-05-17 21:18           ` Andrea Arcangeli
2004-05-17 21:42             ` Andrew Morton
2004-05-17 23:50               ` Andrea Arcangeli
2004-05-18 13:33               ` Jack Steiner
2004-05-18 23:13               ` Matt Mackall
  -- strict thread matches above, loose matches on Subject: below --
2004-05-20 11:36 Manfred Spraul

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040507204938.GA27428@sgi.com \
    --to=steiner@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox