[patch] sched: align rq to cacheline boundary

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] sched: align rq to cacheline boundary
@ 2007-04-09 18:08 Siddha, Suresh B
  2007-04-09 20:40 ` Andrew Morton
  2007-04-10  6:24 ` Ingo Molnar
  0 siblings, 2 replies; 10+ messages in thread
From: Siddha, Suresh B @ 2007-04-09 18:08 UTC (permalink / raw)
  To: akpm; +Cc: mingo, nickpiggin, linux-kernel, suresh.b.siddha

Align the per cpu runqueue to the cacheline boundary. This will minimize the
number of cachelines touched during remote wakeup.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index b9a6837..eca33c5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -278,7 +278,7 @@ struct rq {
 	struct lock_class_key rq_lock_key;
 };
 
-static DEFINE_PER_CPU(struct rq, runqueues);
+static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
 
 static inline int cpu_of(struct rq *rq)
 {

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 18:08 [patch] sched: align rq to cacheline boundary Siddha, Suresh B
@ 2007-04-09 20:40 ` Andrew Morton
  2007-04-09 21:53   ` Ravikiran G Thirumalai
                     ` (2 more replies)
  2007-04-10  6:24 ` Ingo Molnar
  1 sibling, 3 replies; 10+ messages in thread
From: Andrew Morton @ 2007-04-09 20:40 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: mingo, nickpiggin, linux-kernel, Andi Kleen,
	Ravikiran G Thirumalai

On Mon, 9 Apr 2007 11:08:53 -0700
"Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:

> Align the per cpu runqueue to the cacheline boundary. This will minimize the
> number of cachelines touched during remote wakeup.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index b9a6837..eca33c5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -278,7 +278,7 @@ struct rq {
>  	struct lock_class_key rq_lock_key;
>  };
>  
> -static DEFINE_PER_CPU(struct rq, runqueues);
> +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;

Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
rather a lot.

Remember also that the linesize on VSMP is 4k.

And that putting a gap in the per-cpu memory like this will reduce its
overall cache-friendliness.

Need more convincing, please.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 20:40 ` Andrew Morton
@ 2007-04-09 21:53   ` Ravikiran G Thirumalai
  2007-04-09 22:17     ` Siddha, Suresh B
  2007-04-10  6:36   ` Ingo Molnar
  2007-04-10  7:37   ` Andi Kleen
  2 siblings, 1 reply; 10+ messages in thread
From: Ravikiran G Thirumalai @ 2007-04-09 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Siddha, Suresh B, mingo, nickpiggin, linux-kernel, Andi Kleen

On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> On Mon, 9 Apr 2007 11:08:53 -0700
> "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
> 
> > Align the per cpu runqueue to the cacheline boundary. This will minimize the
> > number of cachelines touched during remote wakeup.
> > 
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > ---
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index b9a6837..eca33c5 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -278,7 +278,7 @@ struct rq {
> >  	struct lock_class_key rq_lock_key;
> >  };
> >  
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
> rather a lot.
> 
> Remember also that the linesize on VSMP is 4k.
> 
> And that putting a gap in the per-cpu memory like this will reduce its
> overall cache-friendliness.
> 

The internode line size yes.  But Suresh is using ____cacheline_aligned_in_smp,
which uses SMP_CACHE_BYTES (L1_CACHE_BYTES).  So this does not align the 
per-cpu variable to 4k.  However, if the motivation for this patch was 
significant performance difference, then, the above padding needs to be on 
the internode cacheline size using ____cacheline_internodealigned_in_smp.
____cacheline_internodealigned_in_smp aligns a data structure to the
internode line size, which is 4k for vSMPowered systems and L1 line size 
for all other architectures.

As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
aligned per-cpu data in another section, just like we do with 
.data.cacheline_aligned section, but keep this new section between
__percpu_start and __percpu_end?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 21:53   ` Ravikiran G Thirumalai
@ 2007-04-09 22:17     ` Siddha, Suresh B
  2007-04-10  5:00       ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 10+ messages in thread
From: Siddha, Suresh B @ 2007-04-09 22:17 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Andrew Morton, Siddha, Suresh B, mingo, nickpiggin, linux-kernel,
	Andi Kleen

On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote:
> On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> > On Mon, 9 Apr 2007 11:08:53 -0700
> > "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
> > > -static DEFINE_PER_CPU(struct rq, runqueues);
> > > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
> > 
> > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is
> > rather a lot.

Atleast on x86_64, this depends on cpu_possible_map and not NR_CPUS.

> > 
> > Remember also that the linesize on VSMP is 4k.
> > 
> > And that putting a gap in the per-cpu memory like this will reduce its
> > overall cache-friendliness.
> > 
> 
> The internode line size yes.  But Suresh is using ____cacheline_aligned_in_smp,
> which uses SMP_CACHE_BYTES (L1_CACHE_BYTES).  So this does not align the 
> per-cpu variable to 4k.  However, if the motivation for this patch was 
> significant performance difference, then, the above padding needs to be on 
> the internode cacheline size using ____cacheline_internodealigned_in_smp.

I see a 0.5% perf improvement on database workload(which is a good
improvement for this workload). This patch is minimizing number of cache
lines that it touches during a remote task wakeup.

Kiran, can you educate me when I am supposed to use
____cacheline_aligned_in_smp
Vs
__cacheline_aligned_in_smp ?

> As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
> aligned per-cpu data in another section, just like we do with 
> .data.cacheline_aligned section, but keep this new section between
> __percpu_start and __percpu_end?

Yes. But that will still waste some memory in the new section, if the data
elements are not multiples of 4k.

thanks,
suresh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 22:17     ` Siddha, Suresh B
@ 2007-04-10  5:00       ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 10+ messages in thread
From: Ravikiran G Thirumalai @ 2007-04-10  5:00 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, mingo, nickpiggin, linux-kernel, Andi Kleen

On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote:
> On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote:
> > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote:
> > > On Mon, 9 Apr 2007 11:08:53 -0700
> > > "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
> 
> Kiran, can you educate me when I am supposed to use
> ____cacheline_aligned_in_smp
> Vs
> __cacheline_aligned_in_smp ?

As far as my understanding goes, the four underscore version is for 
aligning members/elements within a data structure, and the two underscore 
version is for aligning statically defined variables.
The dual underscore version places the variable in a separate section meant
for cacheline aligned variables, so that there is no false sharing on the
cacheline with a consecutive datum.  For regular statically defined data
structures, the latter has to be used, but since your patch uses per-cpu data, 
which is already in a separate section, you had to use the former I guess.

> 
> > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline 
> > aligned per-cpu data in another section, just like we do with 
> > .data.cacheline_aligned section, but keep this new section between
> > __percpu_start and __percpu_end?
> 
> Yes. But that will still waste some memory in the new section, if the data
> elements are not multiples of 4k.

Yes.  But the wastage depends on the data structure now being aligned rather
than the structure that happened to be there before.  You cannot not lose
memory while padding I guess :).  But padding for per-cpu data seems a bit 
odd and I am not sure if it is worth it for 0.5% gain.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 18:08 [patch] sched: align rq to cacheline boundary Siddha, Suresh B
  2007-04-09 20:40 ` Andrew Morton
@ 2007-04-10  6:24 ` Ingo Molnar
  2007-04-10 16:40   ` Siddha, Suresh B
  1 sibling, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2007-04-10  6:24 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: akpm, nickpiggin, linux-kernel


* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> Align the per cpu runqueue to the cacheline boundary. This will 
> minimize the number of cachelines touched during remote wakeup.

> -static DEFINE_PER_CPU(struct rq, runqueues);
> +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;

ouch!! Now how did _that_ slip through. The runqueues had been 
cacheline-aligned for ages. Or at least, they were supposed to be.

could you see any improvement in profiles or workloads with this patch 
applied? (just curious - it's an obviously right fix)

> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 20:40 ` Andrew Morton
  2007-04-09 21:53   ` Ravikiran G Thirumalai
@ 2007-04-10  6:36   ` Ingo Molnar
  2007-04-10 23:47     ` Siddha, Suresh B
  2007-04-10  7:37   ` Andi Kleen
  2 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2007-04-10  6:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Siddha, Suresh B, nickpiggin, linux-kernel, Andi Kleen,
	Ravikiran G Thirumalai

* Andrew Morton <akpm@linux-foundation.org> wrote:

> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, 
> which is rather a lot.

yes - but one (special) issue here is that there are other 'hot' but 
truly per-CPU structures nearby:

 ffffffff8067e800 D per_cpu__current_kprobe
 ffffffff8067e820 D per_cpu__kprobe_ctlblk
 ffffffff8067e960 D per_cpu__mmu_gathers
 ffffffff8067f960 d per_cpu__runqueues
 ffffffff80680c60 d per_cpu__cpu_domains
 ffffffff80680df0 d per_cpu__sched_group_cpus

cpu_domains is being dirtied too (sd->nr_balance_failed, 
sd->last_balanc, etc.) and mmu_gathers too. So while both mmu_gathers 
and cpu_domains are mostly purely per-CPU, runqueue fields can bounce 
around alot and drag those nearby fields with them (and then get dragged 
back due to those nearby fields being used per-CPU again.)

the runqueue is really supposed to be cacheline-isolated at _both_ ends 
- at its beginning and at its end as well.

> And that putting a gap in the per-cpu memory like this will reduce its 
> overall cache-friendliness.

yes - although the per-cpu runqueue overhead is nearly 5K anyway.

> Remember also that the linesize on VSMP is 4k.

that sucks ...

maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED 
area that collects per-cpu fields that also have significant cross-CPU 
use and need cacheline isolation? Such cacheline-aligned variables, if 
collected separately, would pack up more tightly and would cause only 
half of the wasted space.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-09 20:40 ` Andrew Morton
  2007-04-09 21:53   ` Ravikiran G Thirumalai
  2007-04-10  6:36   ` Ingo Molnar
@ 2007-04-10  7:37   ` Andi Kleen
  2 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2007-04-10  7:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Siddha, Suresh B, mingo, nickpiggin, linux-kernel,
	Ravikiran G Thirumalai

> >  
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
> 
> Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, 

On x86 just the real possible map now -- that tends to be much smaller.

There might be some other architectures who still allocate per cpu
for all of NR_CPUs (or always set possible map to that), but those
should be just fixed.

> which is 
> rather a lot.

We should have solved the problem of limited per cpu space in .22 at least
with some patches by Jeremy. I also plan a few other changes the will
use more per CPU memory again.

> Remember also that the linesize on VSMP is 4k.
> 
> And that putting a gap in the per-cpu memory like this will reduce its
> overall cache-friendliness.

When he avoids false sharing on remote wakeup it should be more cache friendly.

> Need more convincing, please.

Was this based on some benchmark where it showed?

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-10  6:24 ` Ingo Molnar
@ 2007-04-10 16:40   ` Siddha, Suresh B
  0 siblings, 0 replies; 10+ messages in thread
From: Siddha, Suresh B @ 2007-04-10 16:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Siddha, Suresh B, akpm, nickpiggin, linux-kernel

On Tue, Apr 10, 2007 at 08:24:22AM +0200, Ingo Molnar wrote:
> 
> * Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:
> 
> > Align the per cpu runqueue to the cacheline boundary. This will 
> > minimize the number of cachelines touched during remote wakeup.
> 
> > -static DEFINE_PER_CPU(struct rq, runqueues);
> > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
> 
> ouch!! Now how did _that_ slip through. The runqueues had been 
> cacheline-aligned for ages. Or at least, they were supposed to be.

perhaps the per_cpu definition gave the impression of cacheline aligned too..

> 
> could you see any improvement in profiles or workloads with this patch 
> applied? (just curious - it's an obviously right fix)

We have seen 0.5% perf improvement on database workload on a 2 node setup.
0.5% is a very good improvement for this workload.

thanks,
suresh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch] sched: align rq to cacheline boundary
  2007-04-10  6:36   ` Ingo Molnar
@ 2007-04-10 23:47     ` Siddha, Suresh B
  0 siblings, 0 replies; 10+ messages in thread
From: Siddha, Suresh B @ 2007-04-10 23:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Siddha, Suresh B, nickpiggin, linux-kernel,
	Andi Kleen, Ravikiran G Thirumalai

On Tue, Apr 10, 2007 at 08:36:56AM +0200, Ingo Molnar wrote:
> the runqueue is really supposed to be cacheline-isolated at _both_ ends 
> - at its beginning and at its end as well.

Then either we need to define the first element in the struct as
cacheline aligned or move into section where all the elements are
cacheline aligned. My current patch will not align it at the end.

> > Remember also that the linesize on VSMP is 4k.
> 
> that sucks ...
> 
> maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED 
> area that collects per-cpu fields that also have significant cross-CPU 
> use and need cacheline isolation? Such cacheline-aligned variables, if 
> collected separately, would pack up more tightly and would cause only 
> half of the wasted space.

I will post a patch(once Jermey's page align percpu patch goes in to -mm)

thanks,
suresh

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-04-10 23:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-09 18:08 [patch] sched: align rq to cacheline boundary Siddha, Suresh B
2007-04-09 20:40 ` Andrew Morton
2007-04-09 21:53   ` Ravikiran G Thirumalai
2007-04-09 22:17     ` Siddha, Suresh B
2007-04-10  5:00       ` Ravikiran G Thirumalai
2007-04-10  6:36   ` Ingo Molnar
2007-04-10 23:47     ` Siddha, Suresh B
2007-04-10  7:37   ` Andi Kleen
2007-04-10  6:24 ` Ingo Molnar
2007-04-10 16:40   ` Siddha, Suresh B

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox