* [patch] sched: align rq to cacheline boundary @ 2007-04-09 18:08 Siddha, Suresh B 2007-04-09 20:40 ` Andrew Morton 2007-04-10 6:24 ` Ingo Molnar 0 siblings, 2 replies; 10+ messages in thread From: Siddha, Suresh B @ 2007-04-09 18:08 UTC (permalink / raw) To: akpm; +Cc: mingo, nickpiggin, linux-kernel, suresh.b.siddha Align the per cpu runqueue to the cacheline boundary. This will minimize the number of cachelines touched during remote wakeup. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> --- diff --git a/kernel/sched.c b/kernel/sched.c index b9a6837..eca33c5 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -278,7 +278,7 @@ struct rq { struct lock_class_key rq_lock_key; }; -static DEFINE_PER_CPU(struct rq, runqueues); +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; static inline int cpu_of(struct rq *rq) { ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 18:08 [patch] sched: align rq to cacheline boundary Siddha, Suresh B @ 2007-04-09 20:40 ` Andrew Morton 2007-04-09 21:53 ` Ravikiran G Thirumalai ` (2 more replies) 2007-04-10 6:24 ` Ingo Molnar 1 sibling, 3 replies; 10+ messages in thread From: Andrew Morton @ 2007-04-09 20:40 UTC (permalink / raw) To: Siddha, Suresh B Cc: mingo, nickpiggin, linux-kernel, Andi Kleen, Ravikiran G Thirumalai On Mon, 9 Apr 2007 11:08:53 -0700 "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote: > Align the per cpu runqueue to the cacheline boundary. This will minimize the > number of cachelines touched during remote wakeup. > > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> > --- > > diff --git a/kernel/sched.c b/kernel/sched.c > index b9a6837..eca33c5 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -278,7 +278,7 @@ struct rq { > struct lock_class_key rq_lock_key; > }; > > -static DEFINE_PER_CPU(struct rq, runqueues); > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is rather a lot. Remember also that the linesize on VSMP is 4k. And that putting a gap in the per-cpu memory like this will reduce its overall cache-friendliness. Need more convincing, please. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 20:40 ` Andrew Morton @ 2007-04-09 21:53 ` Ravikiran G Thirumalai 2007-04-09 22:17 ` Siddha, Suresh B 2007-04-10 6:36 ` Ingo Molnar 2007-04-10 7:37 ` Andi Kleen 2 siblings, 1 reply; 10+ messages in thread From: Ravikiran G Thirumalai @ 2007-04-09 21:53 UTC (permalink / raw) To: Andrew Morton Cc: Siddha, Suresh B, mingo, nickpiggin, linux-kernel, Andi Kleen On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > On Mon, 9 Apr 2007 11:08:53 -0700 > "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote: > > > Align the per cpu runqueue to the cacheline boundary. This will minimize the > > number of cachelines touched during remote wakeup. > > > > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> > > --- > > > > diff --git a/kernel/sched.c b/kernel/sched.c > > index b9a6837..eca33c5 100644 > > --- a/kernel/sched.c > > +++ b/kernel/sched.c > > @@ -278,7 +278,7 @@ struct rq { > > struct lock_class_key rq_lock_key; > > }; > > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is > rather a lot. > > Remember also that the linesize on VSMP is 4k. > > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. > The internode line size yes. But Suresh is using ____cacheline_aligned_in_smp, which uses SMP_CACHE_BYTES (L1_CACHE_BYTES). So this does not align the per-cpu variable to 4k. However, if the motivation for this patch was significant performance difference, then, the above padding needs to be on the internode cacheline size using ____cacheline_internodealigned_in_smp. ____cacheline_internodealigned_in_smp aligns a data structure to the internode line size, which is 4k for vSMPowered systems and L1 line size for all other architectures. As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline aligned per-cpu data in another section, just like we do with .data.cacheline_aligned section, but keep this new section between __percpu_start and __percpu_end? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 21:53 ` Ravikiran G Thirumalai @ 2007-04-09 22:17 ` Siddha, Suresh B 2007-04-10 5:00 ` Ravikiran G Thirumalai 0 siblings, 1 reply; 10+ messages in thread From: Siddha, Suresh B @ 2007-04-09 22:17 UTC (permalink / raw) To: Ravikiran G Thirumalai Cc: Andrew Morton, Siddha, Suresh B, mingo, nickpiggin, linux-kernel, Andi Kleen On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote: > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > > On Mon, 9 Apr 2007 11:08:53 -0700 > > "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote: > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; > > > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, which is > > rather a lot. Atleast on x86_64, this depends on cpu_possible_map and not NR_CPUS. > > > > Remember also that the linesize on VSMP is 4k. > > > > And that putting a gap in the per-cpu memory like this will reduce its > > overall cache-friendliness. > > > > The internode line size yes. But Suresh is using ____cacheline_aligned_in_smp, > which uses SMP_CACHE_BYTES (L1_CACHE_BYTES). So this does not align the > per-cpu variable to 4k. However, if the motivation for this patch was > significant performance difference, then, the above padding needs to be on > the internode cacheline size using ____cacheline_internodealigned_in_smp. I see a 0.5% perf improvement on database workload(which is a good improvement for this workload). This patch is minimizing number of cache lines that it touches during a remote task wakeup. Kiran, can you educate me when I am supposed to use ____cacheline_aligned_in_smp Vs __cacheline_aligned_in_smp ? > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline > aligned per-cpu data in another section, just like we do with > .data.cacheline_aligned section, but keep this new section between > __percpu_start and __percpu_end? Yes. But that will still waste some memory in the new section, if the data elements are not multiples of 4k. thanks, suresh ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 22:17 ` Siddha, Suresh B @ 2007-04-10 5:00 ` Ravikiran G Thirumalai 0 siblings, 0 replies; 10+ messages in thread From: Ravikiran G Thirumalai @ 2007-04-10 5:00 UTC (permalink / raw) To: Siddha, Suresh B Cc: Andrew Morton, mingo, nickpiggin, linux-kernel, Andi Kleen On Mon, Apr 09, 2007 at 03:17:05PM -0700, Siddha, Suresh B wrote: > On Mon, Apr 09, 2007 at 02:53:09PM -0700, Ravikiran G Thirumalai wrote: > > On Mon, Apr 09, 2007 at 01:40:57PM -0700, Andrew Morton wrote: > > > On Mon, 9 Apr 2007 11:08:53 -0700 > > > "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote: > > Kiran, can you educate me when I am supposed to use > ____cacheline_aligned_in_smp > Vs > __cacheline_aligned_in_smp ? As far as my understanding goes, the four underscore version is for aligning members/elements within a data structure, and the two underscore version is for aligning statically defined variables. The dual underscore version places the variable in a separate section meant for cacheline aligned variables, so that there is no false sharing on the cacheline with a consecutive datum. For regular statically defined data structures, the latter has to be used, but since your patch uses per-cpu data, which is already in a separate section, you had to use the former I guess. > > > As for the (linesize-4 * NR_CPUS) wastage, maybe we can place the cacheline > > aligned per-cpu data in another section, just like we do with > > .data.cacheline_aligned section, but keep this new section between > > __percpu_start and __percpu_end? > > Yes. But that will still waste some memory in the new section, if the data > elements are not multiples of 4k. Yes. But the wastage depends on the data structure now being aligned rather than the structure that happened to be there before. You cannot not lose memory while padding I guess :). But padding for per-cpu data seems a bit odd and I am not sure if it is worth it for 0.5% gain. Thanks, Kiran ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 20:40 ` Andrew Morton 2007-04-09 21:53 ` Ravikiran G Thirumalai @ 2007-04-10 6:36 ` Ingo Molnar 2007-04-10 23:47 ` Siddha, Suresh B 2007-04-10 7:37 ` Andi Kleen 2 siblings, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2007-04-10 6:36 UTC (permalink / raw) To: Andrew Morton Cc: Siddha, Suresh B, nickpiggin, linux-kernel, Andi Kleen, Ravikiran G Thirumalai * Andrew Morton <akpm@linux-foundation.org> wrote: > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, > which is rather a lot. yes - but one (special) issue here is that there are other 'hot' but truly per-CPU structures nearby: ffffffff8067e800 D per_cpu__current_kprobe ffffffff8067e820 D per_cpu__kprobe_ctlblk ffffffff8067e960 D per_cpu__mmu_gathers ffffffff8067f960 d per_cpu__runqueues ffffffff80680c60 d per_cpu__cpu_domains ffffffff80680df0 d per_cpu__sched_group_cpus cpu_domains is being dirtied too (sd->nr_balance_failed, sd->last_balanc, etc.) and mmu_gathers too. So while both mmu_gathers and cpu_domains are mostly purely per-CPU, runqueue fields can bounce around alot and drag those nearby fields with them (and then get dragged back due to those nearby fields being used per-CPU again.) the runqueue is really supposed to be cacheline-isolated at _both_ ends - at its beginning and at its end as well. > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. yes - although the per-cpu runqueue overhead is nearly 5K anyway. > Remember also that the linesize on VSMP is 4k. that sucks ... maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED area that collects per-cpu fields that also have significant cross-CPU use and need cacheline isolation? Such cacheline-aligned variables, if collected separately, would pack up more tightly and would cause only half of the wasted space. Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-10 6:36 ` Ingo Molnar @ 2007-04-10 23:47 ` Siddha, Suresh B 0 siblings, 0 replies; 10+ messages in thread From: Siddha, Suresh B @ 2007-04-10 23:47 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Siddha, Suresh B, nickpiggin, linux-kernel, Andi Kleen, Ravikiran G Thirumalai On Tue, Apr 10, 2007 at 08:36:56AM +0200, Ingo Molnar wrote: > the runqueue is really supposed to be cacheline-isolated at _both_ ends > - at its beginning and at its end as well. Then either we need to define the first element in the struct as cacheline aligned or move into section where all the elements are cacheline aligned. My current patch will not align it at the end. > > Remember also that the linesize on VSMP is 4k. > > that sucks ... > > maybe, to mitigate some of the costs, do a special PER_CPU_CACHE_ALIGNED > area that collects per-cpu fields that also have significant cross-CPU > use and need cacheline isolation? Such cacheline-aligned variables, if > collected separately, would pack up more tightly and would cause only > half of the wasted space. I will post a patch(once Jermey's page align percpu patch goes in to -mm) thanks, suresh ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 20:40 ` Andrew Morton 2007-04-09 21:53 ` Ravikiran G Thirumalai 2007-04-10 6:36 ` Ingo Molnar @ 2007-04-10 7:37 ` Andi Kleen 2 siblings, 0 replies; 10+ messages in thread From: Andi Kleen @ 2007-04-10 7:37 UTC (permalink / raw) To: Andrew Morton Cc: Siddha, Suresh B, mingo, nickpiggin, linux-kernel, Ravikiran G Thirumalai > > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; > > Remember that this can consume up to (linesize-4 * NR_CPUS) bytes, On x86 just the real possible map now -- that tends to be much smaller. There might be some other architectures who still allocate per cpu for all of NR_CPUs (or always set possible map to that), but those should be just fixed. > which is > rather a lot. We should have solved the problem of limited per cpu space in .22 at least with some patches by Jeremy. I also plan a few other changes the will use more per CPU memory again. > Remember also that the linesize on VSMP is 4k. > > And that putting a gap in the per-cpu memory like this will reduce its > overall cache-friendliness. When he avoids false sharing on remote wakeup it should be more cache friendly. > Need more convincing, please. Was this based on some benchmark where it showed? -Andi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-09 18:08 [patch] sched: align rq to cacheline boundary Siddha, Suresh B 2007-04-09 20:40 ` Andrew Morton @ 2007-04-10 6:24 ` Ingo Molnar 2007-04-10 16:40 ` Siddha, Suresh B 1 sibling, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2007-04-10 6:24 UTC (permalink / raw) To: Siddha, Suresh B; +Cc: akpm, nickpiggin, linux-kernel * Siddha, Suresh B <suresh.b.siddha@intel.com> wrote: > Align the per cpu runqueue to the cacheline boundary. This will > minimize the number of cachelines touched during remote wakeup. > -static DEFINE_PER_CPU(struct rq, runqueues); > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; ouch!! Now how did _that_ slip through. The runqueues had been cacheline-aligned for ages. Or at least, they were supposed to be. could you see any improvement in profiles or workloads with this patch applied? (just curious - it's an obviously right fix) > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [patch] sched: align rq to cacheline boundary 2007-04-10 6:24 ` Ingo Molnar @ 2007-04-10 16:40 ` Siddha, Suresh B 0 siblings, 0 replies; 10+ messages in thread From: Siddha, Suresh B @ 2007-04-10 16:40 UTC (permalink / raw) To: Ingo Molnar; +Cc: Siddha, Suresh B, akpm, nickpiggin, linux-kernel On Tue, Apr 10, 2007 at 08:24:22AM +0200, Ingo Molnar wrote: > > * Siddha, Suresh B <suresh.b.siddha@intel.com> wrote: > > > Align the per cpu runqueue to the cacheline boundary. This will > > minimize the number of cachelines touched during remote wakeup. > > > -static DEFINE_PER_CPU(struct rq, runqueues); > > +static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp; > > ouch!! Now how did _that_ slip through. The runqueues had been > cacheline-aligned for ages. Or at least, they were supposed to be. perhaps the per_cpu definition gave the impression of cacheline aligned too.. > > could you see any improvement in profiles or workloads with this patch > applied? (just curious - it's an obviously right fix) We have seen 0.5% perf improvement on database workload on a 2 node setup. 0.5% is a very good improvement for this workload. thanks, suresh ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-04-10 23:48 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-04-09 18:08 [patch] sched: align rq to cacheline boundary Siddha, Suresh B 2007-04-09 20:40 ` Andrew Morton 2007-04-09 21:53 ` Ravikiran G Thirumalai 2007-04-09 22:17 ` Siddha, Suresh B 2007-04-10 5:00 ` Ravikiran G Thirumalai 2007-04-10 6:36 ` Ingo Molnar 2007-04-10 23:47 ` Siddha, Suresh B 2007-04-10 7:37 ` Andi Kleen 2007-04-10 6:24 ` Ingo Molnar 2007-04-10 16:40 ` Siddha, Suresh B
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox