All of lore.kernel.org
 help / color / mirror / Atom feed
From: "John Hawkes" <hawkes@sgi.com>
To: "Chen, Kenneth W" <kenneth.w.chen@intel.com>,
	Tony Luck <tony.luck@gmail.com>, Andrew Morton <akpm@osdl.org>,
	linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Jack Steiner <steiner@sgi.com>, Dan Higgins <djh@sgi.com>,
	John Hesterberg <jh@sgi.com>, Greg Edwards <edwardsg@sgi.com>
Subject: Re: [PATCH] ia64: change defconfig to NR_CPUS==1024
Date: Fri, 06 Jan 2006 17:06:18 +0000	[thread overview]
Message-ID: <000701c612e3$8324eff0$6f00a8c0@comcast.net> (raw)
In-Reply-To: 200601052233.k05MX4g15045@unix-os.sc.intel.com

From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
> What type of heavy workloads have you measured? Including db transaction
> processing and decision making workloads?

I haven't used a db transaction processing benchmark, but I have used other
workloads with large process counts and high context-switch rates.

> > The potential
> > extra cachemiss seems to be lost in the noise.  The for_each_*cpu()
> > macros are relatively efficient in skipping past zeroed cpumask bits.
> > Workloads that impose higher loads on the CPU Scheduler tend to
> > bottleneck on non-Scheduler parts of the kernel, and it's the Scheduler
> > which makes the principal use of the cpumask_t, so these extra
> > cachemiss inefficiencies and extra CPU cycles to scan zero mask words
> > just get lost in the general system overhead.
>
> I found above claims are generally false for workload that puts tons
> of pressure on CPU cache, especially with db workload.  Typically
> for db workload, the working set in user space is so large that making
> a trip into the kernel has far large secondary effect then the primary
> cache miss occurred in the kernel.  In other word, cache lines evicted
> by the kernel code have far larger impact to the overall application
> performance and leads to lower overall lower system performance.  So
> when you say "get lost in the general system overhead", did you consider
> the secondary effect it does to the application performance?

The current default is 512p, which is 8 words -- a cacheline.  Increasing to
1024p adds an additional 8 words -- one cacheline -- to the cpumask_t.  I
doubt you're going to see a performance regression on your db transaction
processing benchmark because of an additional cachemiss during active or
passive load-balancing.

I agree that throughout the kernel we ought to be aware of increasing
cachemisses and the lengthening code paths, but I don't believe this
particular one is some evil that needs to be suppressed.  We have far more
micro-performance-impacting algorithms and data structures in the kernel right
now that we ought to consider -- e.g., cache coloring conflicts with the
struct runqueue -- as well as the obvious algorithm tweaks that greatly affect
processor assignments -- e.g., whether or not to call wake_idle().

> What we found is going from NR_CPU = 64 to 128, it has small performance
> impact to db transaction processing workload.  Though I have not measured
> difference between 128 to 1024.

Going from 64 (one word) to >64 (an array of words) produces a qualitative
change to the emitted code in how the cpumask_t is passed in calling sequences
and how it is manipulated.  I completely understand that you can detect a
small performance regression between 64 and 128.  I just don't believe you can
conclude that going from 512 to 1024 will exhibit a similar measurable
regression.

John Hawkes


WARNING: multiple messages have this Message-ID (diff)
From: "John Hawkes" <hawkes@sgi.com>
To: "Chen, Kenneth W" <kenneth.w.chen@intel.com>,
	"Tony Luck" <tony.luck@gmail.com>,
	"Andrew Morton" <akpm@osdl.org>, <linux-ia64@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Cc: "Jack Steiner" <steiner@sgi.com>, "Dan Higgins" <djh@sgi.com>,
	"John Hesterberg" <jh@sgi.com>, "Greg Edwards" <edwardsg@sgi.com>
Subject: Re: [PATCH] ia64: change defconfig to NR_CPUS==1024
Date: Fri, 6 Jan 2006 09:06:18 -0800	[thread overview]
Message-ID: <000701c612e3$8324eff0$6f00a8c0@comcast.net> (raw)
In-Reply-To: 200601052233.k05MX4g15045@unix-os.sc.intel.com

From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
> What type of heavy workloads have you measured? Including db transaction
> processing and decision making workloads?

I haven't used a db transaction processing benchmark, but I have used other
workloads with large process counts and high context-switch rates.

> > The potential
> > extra cachemiss seems to be lost in the noise.  The for_each_*cpu()
> > macros are relatively efficient in skipping past zeroed cpumask bits.
> > Workloads that impose higher loads on the CPU Scheduler tend to
> > bottleneck on non-Scheduler parts of the kernel, and it's the Scheduler
> > which makes the principal use of the cpumask_t, so these extra
> > cachemiss inefficiencies and extra CPU cycles to scan zero mask words
> > just get lost in the general system overhead.
>
> I found above claims are generally false for workload that puts tons
> of pressure on CPU cache, especially with db workload.  Typically
> for db workload, the working set in user space is so large that making
> a trip into the kernel has far large secondary effect then the primary
> cache miss occurred in the kernel.  In other word, cache lines evicted
> by the kernel code have far larger impact to the overall application
> performance and leads to lower overall lower system performance.  So
> when you say "get lost in the general system overhead", did you consider
> the secondary effect it does to the application performance?

The current default is 512p, which is 8 words -- a cacheline.  Increasing to
1024p adds an additional 8 words -- one cacheline -- to the cpumask_t.  I
doubt you're going to see a performance regression on your db transaction
processing benchmark because of an additional cachemiss during active or
passive load-balancing.

I agree that throughout the kernel we ought to be aware of increasing
cachemisses and the lengthening code paths, but I don't believe this
particular one is some evil that needs to be suppressed.  We have far more
micro-performance-impacting algorithms and data structures in the kernel right
now that we ought to consider -- e.g., cache coloring conflicts with the
struct runqueue -- as well as the obvious algorithm tweaks that greatly affect
processor assignments -- e.g., whether or not to call wake_idle().

> What we found is going from NR_CPU = 64 to 128, it has small performance
> impact to db transaction processing workload.  Though I have not measured
> difference between 128 to 1024.

Going from 64 (one word) to >64 (an array of words) produces a qualitative
change to the emitted code in how the cpumask_t is passed in calling sequences
and how it is manipulated.  I completely understand that you can detect a
small performance regression between 64 and 128.  I just don't believe you can
conclude that going from 512 to 1024 will exhibit a similar measurable
regression.

John Hawkes


  reply	other threads:[~2006-01-06 17:06 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-01-05 21:39 [PATCH] ia64: change defconfig to NR_CPUS==1024 hawkes
2006-01-05 21:39 ` hawkes
2006-01-05 22:33 ` Chen, Kenneth W
2006-01-05 22:33   ` Chen, Kenneth W
2006-01-06 17:06   ` John Hawkes [this message]
2006-01-06 17:06     ` John Hawkes
2006-01-06  8:38 ` Arjan van de Ven
2006-01-06  8:38   ` Arjan van de Ven
2006-01-06 17:19 ` Luck, Tony
2006-01-06 17:19   ` Luck, Tony
2006-01-06 17:24   ` Arjan van de Ven
2006-01-06 17:24     ` Arjan van de Ven
2006-01-06 17:26   ` Matthew Wilcox
2006-01-06 17:26     ` Matthew Wilcox
2006-01-06 17:45 ` Luck, Tony
2006-01-06 17:45   ` Luck, Tony
2006-01-06 17:49   ` Matthew Wilcox
2006-01-06 17:49     ` Matthew Wilcox
2006-01-06 18:04     ` Christoph Lameter
2006-01-06 18:04       ` Christoph Lameter
2006-01-06 18:07       ` Matthew Wilcox
2006-01-06 18:07         ` Matthew Wilcox
2006-01-06 18:19       ` Randy.Dunlap
2006-01-06 18:19         ` Randy.Dunlap
2006-01-06 18:37         ` Christoph Lameter
2006-01-06 18:37           ` Christoph Lameter
2006-01-06 18:59           ` Arjan van de Ven
2006-01-06 18:59             ` Arjan van de Ven
2006-01-06 20:17             ` Alan Cox
2006-01-06 20:18               ` Randy.Dunlap
2006-01-06 20:18                 ` Randy.Dunlap
2006-01-06 20:42             ` Rohit Seth
2006-01-06 20:42               ` Rohit Seth
2006-01-06 21:00           ` Dave Jones
2006-01-06 21:00             ` Dave Jones
2006-01-06 18:25   ` Adrian Bunk
2006-01-06 18:25     ` Adrian Bunk
2006-01-12  0:09 ` Paul Jackson
2006-01-12  0:09   ` Paul Jackson
2006-01-12 19:04   ` Christoph Lameter
2006-01-12 19:04     ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='000701c612e3$8324eff0$6f00a8c0@comcast.net' \
    --to=hawkes@sgi.com \
    --cc=akpm@osdl.org \
    --cc=djh@sgi.com \
    --cc=edwardsg@sgi.com \
    --cc=jh@sgi.com \
    --cc=kenneth.w.chen@intel.com \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=steiner@sgi.com \
    --cc=tony.luck@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.