[Patch] Scale pidhash_shift/pidhash_size up based on num_possible

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
@ 2008-07-31 17:00 Robin Holt
  2008-07-31 18:35 ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Holt @ 2008-07-31 17:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage,
	Eric W. Biederman, Linus Torvalds, Andrew Morton


For large cpu configurations, we find the number of pids in a pidhash
bucket cause things like 'ps' to perform slowly.  Raising pidhash_shift
from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine.

This patch makes the upper limit scale based upon num_possible_cpus().
For machines 128 cpus or less, the current upper limit of 12 is
maintained.

Signed-off-by: Robin Holt <holt@sgi.com>


Index: contention_unroll/kernel/pid.c
===================================================================
--- contention_unroll.orig/kernel/pid.c	2008-07-31 11:59:21.154284073 -0500
+++ contention_unroll/kernel/pid.c	2008-07-31 11:59:22.862497720 -0500
@@ -502,9 +502,10 @@ void __init pidhash_init(void)
 {
 	int i, pidhash_size;
 	unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT);
+	int pidhash_shift_ul = max(12, fls(num_possible_cpus() - 1) + 5);
 
 	pidhash_shift = max(4, fls(megabytes * 4));
-	pidhash_shift = min(12, pidhash_shift);
+	pidhash_shift = min(pidhash_shift_ul, pidhash_shift);
 	pidhash_size = 1 << pidhash_shift;
 
 	printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-07-31 17:00 [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus() Robin Holt
@ 2008-07-31 18:35 ` Eric W. Biederman
  2008-07-31 19:32   ` Robin Holt
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-07-31 18:35 UTC (permalink / raw)
  To: Robin Holt
  Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu,
	Paul Menage, Linus Torvalds, Andrew Morton

Robin Holt <holt@sgi.com> writes:

> For large cpu configurations, we find the number of pids in a pidhash
> bucket cause things like 'ps' to perform slowly.  Raising pidhash_shift
> from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine.
>
> This patch makes the upper limit scale based upon num_possible_cpus().
> For machines 128 cpus or less, the current upper limit of 12 is
> maintained.

It looks like there is a magic limit we are dancing around.

Can we please make the maximum for the hash table size be based
on the maximum number of pids.  That is fls(PID_MAX_LIMIT) - 6?

12 actually looks like it was fls(PID_MAX_LIMIT) - 3 at one point in
time.

Basing the hash table size on PID_MAX_LIMIT seems to say interesting
about the maximum number of hash chain entries we will tolerate in
practice.

Eric

> Signed-off-by: Robin Holt <holt@sgi.com>
>
>
> Index: contention_unroll/kernel/pid.c
> ===================================================================
> --- contention_unroll.orig/kernel/pid.c	2008-07-31 11:59:21.154284073 -0500
> +++ contention_unroll/kernel/pid.c	2008-07-31 11:59:22.862497720 -0500
> @@ -502,9 +502,10 @@ void __init pidhash_init(void)
>  {
>  	int i, pidhash_size;
>  	unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT);
> +	int pidhash_shift_ul = max(12, fls(num_possible_cpus() - 1) + 5);
>  
>  	pidhash_shift = max(4, fls(megabytes * 4));
> -	pidhash_shift = min(12, pidhash_shift);
> +	pidhash_shift = min(pidhash_shift_ul, pidhash_shift);
>  	pidhash_size = 1 << pidhash_shift;
>  
>  	printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-07-31 18:35 ` Eric W. Biederman
@ 2008-07-31 19:32   ` Robin Holt
  2008-07-31 19:49     ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Holt @ 2008-07-31 19:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

On Thu, Jul 31, 2008 at 11:35:19AM -0700, Eric W. Biederman wrote:
> Robin Holt <holt@sgi.com> writes:
> 
> > For large cpu configurations, we find the number of pids in a pidhash
> > bucket cause things like 'ps' to perform slowly.  Raising pidhash_shift
> > from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine.
> >
> > This patch makes the upper limit scale based upon num_possible_cpus().
> > For machines 128 cpus or less, the current upper limit of 12 is
> > maintained.
> 
> It looks like there is a magic limit we are dancing around.
> 
> Can we please make the maximum for the hash table size be based
> on the maximum number of pids.  That is fls(PID_MAX_LIMIT) - 6?

I am happy to base it upon whatever you think is correct.  So long as it
goes up for machines with lots of cpus, that will satisfy me.  It is
probably as much a problem on smaller machines, but if you have _THAT_
many pids in use, you are probably oversubscribing many other resources
and don't really care.  That limit will essentially become a constant
(compiler may even do that for us but I have not checked any arch other
that ia64).  Should I just replace the 12 with a 16 or 17 or some new
magic number?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-07-31 19:32   ` Robin Holt
@ 2008-07-31 19:49     ` Eric W. Biederman
  2008-07-31 20:08       ` Robin Holt
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-07-31 19:49 UTC (permalink / raw)
  To: Robin Holt
  Cc: Eric W. Biederman, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

Robin Holt <holt@sgi.com> writes:

> On Thu, Jul 31, 2008 at 11:35:19AM -0700, Eric W. Biederman wrote:
>> Robin Holt <holt@sgi.com> writes:
>> 
>> > For large cpu configurations, we find the number of pids in a pidhash
>> > bucket cause things like 'ps' to perform slowly.  Raising pidhash_shift
>> > from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine.
>> >
>> > This patch makes the upper limit scale based upon num_possible_cpus().
>> > For machines 128 cpus or less, the current upper limit of 12 is
>> > maintained.
>> 
>> It looks like there is a magic limit we are dancing around.
>> 
>> Can we please make the maximum for the hash table size be based
>> on the maximum number of pids.  That is fls(PID_MAX_LIMIT) - 6?
>
> I am happy to base it upon whatever you think is correct.  So long as it
> goes up for machines with lots of cpus, that will satisfy me.  It is
> probably as much a problem on smaller machines, but if you have _THAT_
> many pids in use, you are probably oversubscribing many other resources
> and don't really care.  That limit will essentially become a constant
> (compiler may even do that for us but I have not checked any arch other
> that ia64).  Should I just replace the 12 with a 16 or 17 or some new
> magic number?

I like setting the limit as a maximum hash chain length.
Which is what fls(PID_MAX_LIMIT) - X is.  X is the maximum hash chain
length you can tolerate.

Eric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-07-31 19:49     ` Eric W. Biederman
@ 2008-07-31 20:08       ` Robin Holt
  2008-07-31 22:04         ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Holt @ 2008-07-31 20:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

Like so???

I have not tested this yet.

---

Set the upper limit for the pidhash_shift based upon PID_MAX_LIMIT.

Signed-off-by: Robin Holt <holt@sgi.com>


Index: contention_unroll/kernel/pid.c
===================================================================
--- contention_unroll.orig/kernel/pid.c	2008-07-31 11:59:21.154284073 -0500
+++ contention_unroll/kernel/pid.c	2008-07-31 15:07:21.909158788 -0500
@@ -504,7 +504,7 @@ void __init pidhash_init(void)
 	unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT);
 
 	pidhash_shift = max(4, fls(megabytes * 4));
-	pidhash_shift = min(12, pidhash_shift);
+	pidhash_shift = min(fls(PID_MAX_LIMIT) - 6, pidhash_shift);
 	pidhash_size = 1 << pidhash_shift;
 
 	printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-07-31 20:08       ` Robin Holt
@ 2008-07-31 22:04         ` Eric W. Biederman
  2008-08-01 12:04           ` Robin Holt
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-07-31 22:04 UTC (permalink / raw)
  To: Robin Holt
  Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu,
	Paul Menage, Linus Torvalds, Andrew Morton

Robin Holt <holt@sgi.com> writes:

> Like so???
>
> I have not tested this yet.

Looks reasonable to me.

In what circumstances was the lookup in the pid hash table with
long changes causing a performance slowdown?.  We don't perform
a lot of lookups.


> -	pidhash_shift = min(12, pidhash_shift);
> +	pidhash_shift = min(fls(PID_MAX_LIMIT) - 6, pidhash_shift);
>  	pidhash_size = 1 << pidhash_shift;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-07-31 22:04         ` Eric W. Biederman
@ 2008-08-01 12:04           ` Robin Holt
  2008-08-01 18:27             ` Eric W. Biederman
  2008-08-01 18:49             ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Robin Holt @ 2008-08-01 12:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

On Thu, Jul 31, 2008 at 03:04:56PM -0700, Eric W. Biederman wrote:
> Robin Holt <holt@sgi.com> writes:
> 
> > Like so???
> >
> > I have not tested this yet.
> 
> Looks reasonable to me.
> 
> In what circumstances was the lookup in the pid hash table with
> long changes causing a performance slowdown?.  We don't perform
> a lot of lookups.

It was initially detected while profiling 'ps' on a 2048p machine that
had 13 kernel threads per cpu.  We added a couple more device drivers
which added additional threads.  We then started a pthread-on-process
MPI job which had 2048 ranks each with 4 threads (test-case from
customer job).  There were misc other processes out there which brought
our task count up to approx 63k.  Larger page size helped the problem
(went from 16k to 64k).

Thanks,
Robin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-01 12:04           ` Robin Holt
@ 2008-08-01 18:27             ` Eric W. Biederman
  2008-08-01 19:13               ` Robin Holt
  2008-08-01 18:49             ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-08-01 18:27 UTC (permalink / raw)
  To: Robin Holt
  Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu,
	Paul Menage, Linus Torvalds, Andrew Morton

Robin Holt <holt@sgi.com> writes:

> On Thu, Jul 31, 2008 at 03:04:56PM -0700, Eric W. Biederman wrote:
>> Robin Holt <holt@sgi.com> writes:
>> 
>> > Like so???
>> >
>> > I have not tested this yet.
>> 
>> Looks reasonable to me.
>> 
>> In what circumstances was the lookup in the pid hash table with
>> long changes causing a performance slowdown?.  We don't perform
>> a lot of lookups.
>
> It was initially detected while profiling 'ps' on a 2048p machine that
> had 13 kernel threads per cpu.  We added a couple more device drivers
> which added additional threads.  We then started a pthread-on-process
> MPI job which had 2048 ranks each with 4 threads (test-case from
> customer job).  There were misc other processes out there which brought
> our task count up to approx 63k.  Larger page size helped the problem
> (went from 16k to 64k).

Large page size?  Do you mean larger hash size?

What were you measuring that showed improvement with the large hash size?

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-01 18:27             ` Eric W. Biederman
@ 2008-08-01 19:13               ` Robin Holt
  2008-08-01 19:59                 ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Holt @ 2008-08-01 19:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

On Fri, Aug 01, 2008 at 11:27:20AM -0700, Eric W. Biederman wrote:
> Robin Holt <holt@sgi.com> writes:
> 
> > On Thu, Jul 31, 2008 at 03:04:56PM -0700, Eric W. Biederman wrote:
> >> Robin Holt <holt@sgi.com> writes:
> >> 
> >> > Like so???
> >> >
> >> > I have not tested this yet.
> >> 
> >> Looks reasonable to me.
> >> 
> >> In what circumstances was the lookup in the pid hash table with
> >> long changes causing a performance slowdown?.  We don't perform
> >> a lot of lookups.
> >
> > It was initially detected while profiling 'ps' on a 2048p machine that
> > had 13 kernel threads per cpu.  We added a couple more device drivers
> > which added additional threads.  We then started a pthread-on-process
> > MPI job which had 2048 ranks each with 4 threads (test-case from
> > customer job).  There were misc other processes out there which brought
> > our task count up to approx 63k.  Larger page size helped the problem
> > (went from 16k to 64k).
> 
> Large page size?  Do you mean larger hash size?
> 
> What were you measuring that showed improvement with the large hash size?

Oops, confusing details.  That was a different problem we had been
tracking.

Sorry for the confusion,
Robin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-01 19:13               ` Robin Holt
@ 2008-08-01 19:59                 ` Eric W. Biederman
  2008-08-04 13:11                   ` Stephen Champion
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-08-01 19:59 UTC (permalink / raw)
  To: Robin Holt
  Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu,
	Paul Menage, Linus Torvalds, Andrew Morton

Robin Holt <holt@sgi.com> writes:

> Oops, confusing details.  That was a different problem we had been
> tracking.

Which leads back to the original question.  What were you measuring
that showed improvement with a larger pid hash size?

Almost by definition a larger hash table will perform better.  However
my intuition is that we are talking about something that should be in
the noise for most workloads.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-01 19:59                 ` Eric W. Biederman
@ 2008-08-04 13:11                   ` Stephen Champion
  2008-08-04 20:36                     ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Stephen Champion @ 2008-08-04 13:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

Eric W. Biederman wrote:
> Robin Holt <holt@sgi.com> writes:
>> Oops, confusing details.  That was a different problem we had been
>> tracking.
> 
> Which leads back to the original question.  What were you measuring
> that showed improvement with a larger pid hash size?
> 
> Almost by definition a larger hash table will perform better.  However
> my intuition is that we are talking about something that should be in
> the noise for most workloads.

Robin asked me to chime in on this, as I did the early "look at that" 
work and suggested it to Robin.

I noticed the potential for increasing pid_shift while chasing down a 
patch to our kernel (2.6.16 stable based) which had proc_pid_readdir() 
calling find_pid() for init_task through the highest pid #.  This patch 
caused a rather serious problem on a 2048 core Altix.  Before 
identifying the culprit, I increased pidhash_shift.  This made a *huge* 
difference: enough to get the box marginally functional while I tracked 
down the origins of the problem.

After backing out the problematic patch, I took a look at pidhash_shift 
in normal circumstances:  With pidhash_shift == 12, running only a few 
common services and monitoring tools (sendmail, nagios, etc for ~28k 
active processes, mostly of the kernel variety), the 20 cpu boot cpuset 
we use on that system to confine normal system processes and interactive 
logins was spending >1% of it's time in find_pid(), and an 'ls /proc > 
/dev/null' took >0.4s.  With pidhash_shift == 16, the timing went to 
<0.2, and the total time spent in find_pid() was reduced to noise level.

In addition to raising the limit on larger systems, it looked reasonable 
to scale the pid hash with the # processors instead of memory.  While I 
observed variably high process:cpu ratios on small systems (2c - 32c), 
they also have relatively few processes.  The 192c - 2048c systems I was 
able to look at were all hovering at 13 +/- 2 processes per cpu, even 
with wildly varying memory sizes.

Despite more recent changes in proc_pid_readdir, my results should apply 
to current source.  It looks like both the old 2.6.16 implementation and 
the current version will call find_pid (or equivalent) once for each 
successive getdents() call on /proc, excepting when the cursor is on the 
first entry.  A quick look, and we have 88 getdents64() calls both  'ps' 
and 'ls /proc' with 29k processes running, which appears to be the 
primary source of calls.

It's not giganormous, although I probably could come up with a pointless 
microbenchmark to show it's 300% better.  Importantly, it does 
noticeably improve normal interactive tools like 'ps' and 'top', a 
performance visualization tool developed by a customer (nodemon) 
refreshes faster.  For a 512k init allocation, that seems like a very 
good deal.

I'd like to lose 20,000 kernel processes in addition to growing the pid 
hash!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-04 13:11                   ` Stephen Champion
@ 2008-08-04 20:36                     ` Eric W. Biederman
  2008-08-04 23:58                       ` Robin Holt
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-08-04 20:36 UTC (permalink / raw)
  To: Stephen Champion
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

Stephen Champion <schamp@sgi.com> writes:

> Despite more recent changes in proc_pid_readdir, my results should apply to
> current source.  It looks like both the old 2.6.16 implementation and the
> current version will call find_pid (or equivalent) once for each successive
> getdents() call on /proc, excepting when the cursor is on the first entry.  A
> quick look, and we have 88 getdents64() calls both 'ps' and 'ls /proc' with 29k
> processes running, which appears to be the primary source of calls.

Yes.  proc_pid_readdir->next_tgid->find_ge_pid does a hash table lookup.

> It's not giganormous, although I probably could come up with a pointless
> microbenchmark to show it's 300% better.  Importantly, it does noticeably
> improve normal interactive tools like 'ps' and 'top', a performance
> visualization tool developed by a customer (nodemon) refreshes faster.  For a
> 512k init allocation, that seems like a very good deal.

Alright.  The fact that it really is tools like ps and top that are proc intensive
shapes my opinion somewhat.  At it is doing lots and lots of hash tables lookups
so we can see each process that is expensive.  Rather then simple one at
a time lookups.

Hash table sizing gets us into a guessing game of how many pids do we expect
to deal with on this work load.  And we guess by either memory (how many hash
table entries can we afford) or cpu how many hash table entries can we expect.

There is an alternative to growing the hash table that I was playing
with at one time.  Use a radix tree that has essentially the same
properties for insert and delete as today's hash table. rcu for lookup
and a lock (spin lock?) for insert (memory allocation for the intermediate
nodes is the tricky bit).

Because pids are naturally dense we can accelerate the /proc accesses
by walking the radix tree and with a good implementation we only spend
as much memory as we have pids to worry about.

This is particular appealing to me in the context of pid namespaces.

I did not pursue it at the time I thought of it because I could not think
of a real workload that had enough pids to make the performance of the
current pid hash be noticable.  For our normal workload today my rough numbers
said the pid hash is optimal.

The interesting twist to this tale is that you have 29k processes, and
20k of them are kernel threads.  Which seems to say that it isn't user
space that is putting a heavy workloads on the pid infrastructure but
all of the per cpu kernel threads.

At only 9k processes we will have an average hash chain length of 2.25
which should perform well, as we have a good hash function and hash chain
lengths cluster well.  This is as opposed to the average hash chain length
of 7.25 which you are seeing now.

It looks to me like a cap of 64k is excessive we could place it down
at 16k (pid_hash_shift = 14) and your machines should perform well in
ps and top related activities, and it would only cost 128k.  If
you killed the excess kernel processes you could live very happily
with a 8k (pid_hash_shift = 13).

If we want something that tunes to the work load I expect a radix tree
would be best.  If the goal after 4k cpus is 64k cpus which I heard someone
mention I think that is what you really want.  A design that scales to
the workload on the system.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-04 20:36                     ` Eric W. Biederman
@ 2008-08-04 23:58                       ` Robin Holt
  2008-08-05  0:38                         ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Holt @ 2008-08-04 23:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen Champion, Robin Holt, linux-kernel, Pavel Emelyanov,
	Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds,
	Andrew Morton

On Mon, Aug 04, 2008 at 01:36:38PM -0700, Eric W. Biederman wrote:
> Stephen Champion <schamp@sgi.com> writes:
> If we want something that tunes to the work load I expect a radix tree
> would be best.  If the goal after 4k cpus is 64k cpus which I heard someone
> mention I think that is what you really want.  A design that scales to
> the workload on the system.

But if we simply scale based upon num_possible_cpus(), we get a relatively
representative scaling function.  Usually, customers buy machines with 1,
2, or 4GB per cpu.  I would expect a waste of 256k, 512k, or even 1m to
be acceptable at this size of machine.

For 2.6.27, would you accept an upper cap based on the memory size
algorithm you have now and adjusted for num_possible_cpus()?  Essentially
the first patch I posted.

I would like to try and not be responsible for the radix tree
implementation as I have other more pressing obligations.  If, however,
it was a condition of getting an interim solution into 2.6.27, I would
be willing to discuss this with my management.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-04 23:58                       ` Robin Holt
@ 2008-08-05  0:38                         ` Eric W. Biederman
  2008-08-06  3:21                           ` Stephen Champion
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2008-08-05  0:38 UTC (permalink / raw)
  To: Robin Holt
  Cc: Stephen Champion, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

Robin Holt <holt@sgi.com> writes:

> But if we simply scale based upon num_possible_cpus(), we get a relatively
> representative scaling function.  Usually, customers buy machines with 1,
> 2, or 4GB per cpu.  I would expect a waste of 256k, 512k, or even 1m to
> be acceptable at this size of machine.

For your customers, and your kernel thread workload, you get a
reasonable representation.  For other different people and different
workloads you don't.  I happen to know of a completely different
class of workload that can do better.

> For 2.6.27, would you accept an upper cap based on the memory size
> algorithm you have now and adjusted for num_possible_cpus()?  Essentially
> the first patch I posted.

I want to throw a screaming hissy fit.

The merge window has closed.  This is not a bug.  This is not a
regression.  I don't see a single compelling reason to consider this
for 2.6.27.  I asked for clarification so I could be certain you were
solving the right problem.

Why didn't these patches show up 3 months ago when the last merge
window closed?  Why not even earlier?

I totally agree that what we are doing could be done better, however
at this point we should be looking at 2.6.28.  In which case looking
at the general long term non-hack solution is the right way to go.  Can
we scale to different workloads?

For everyone with less then 4K cpus the current behavior is fine, and
with 4k cpus it results in a modest slowdown.  This sounds useable.

You have hit an extremely sore spot with me.  Anytime someone makes an
argument that I hear as RHEL is going to ship 2.6.27 so we _need_ this
patch in 2.6.27 I want to stop listening.  I just don't care.  Unfortunately
I have heard that argument almost once a day for the last week, and I am
tired of it.

Why hasn't someone complained that waitpid is still slow?

Why haven't we seen patches to reduce the number of kernel threads since
last time you had problems with the pid infrastructure?

A very frustrated code reviewer.

So yes.  If you are not interested in 2.6.28 and in the general problem,
I'm not interested in this problem.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-05  0:38                         ` Eric W. Biederman
@ 2008-08-06  3:21                           ` Stephen Champion
  0 siblings, 0 replies; 16+ messages in thread
From: Stephen Champion @ 2008-08-06  3:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton

Eric W. Biederman wrote:
> Robin Holt <holt@sgi.com> writes:
> 
>> But if we simply scale based upon num_possible_cpus(), we get a relatively
>> representative scaling function.  Usually, customers buy machines with 1,
>> 2, or 4GB per cpu.  I would expect a waste of 256k, 512k, or even 1m to
>> be acceptable at this size of machine.
> 
> For your customers, and your kernel thread workload, you get a
> reasonable representation.  For other different people and different
> workloads you don't.  I happen to know of a completely different
> class of workload that can do better.

Although Robin probably had broader experience, I think we have both had 
opportunity to examine the workloads and configuration of a reasonable 
sample of the active (and historical) large (>=512c) shared memory systems.

Some workloads and configurations are specialized, and perhaps less 
stressing that the mixed, volatile loads and array of services most of 
these systems are expected to handle, but the specialized loads have been 
the exceptions in my experience.  That may change as the price/core 
continues to go down and pseudo-shared memory systems based on cluster 
interconnects become more common and possibly even viable, but don't hold 
your breathe.

>> For 2.6.27, would you accept an upper cap based on the memory size
>> algorithm you have now and adjusted for num_possible_cpus()?  Essentially
>> the first patch I posted.
> 
> I want to throw a screaming hissy fit.

If those get more cycles to my users, I'll start reading the list religiously!

> The merge window has closed.  This is not a bug.  This is not a
> regression.  I don't see a single compelling reason to consider this
> for 2.6.27.  I asked for clarification so I could be certain you were
> solving the right problem.

Early in 2.6.28 might work for us.  2.6.27 would be nice.  Yes, we'd like a 
distribution vendor(s) to pull it.  If we ask nicely, the one which matters 
to me (and my users) is quite likely to take it if it has been accepted 
early in the next cycle.  They've been very good about that sort of thing 
(for which I'm very thankful).  So while it's extra administrivia, I'm not 
the one who has to fill out the forms and write up the justification ;-)

But the opposite question: Does the patch proposed have significant risk or 
drawbacks?  We know it offers a minor but noticeable performance 
improvement for at least some of the small set of systems it effects.  Is 
it an unreasonable risk for other systems - or is there a known group of 
systems it would have an affect on which would not benefit or might even 
harm?  Would a revision of it be acceptable, and if so, (based on answers 
to the prior questions) what criteria should a revision meet, and what time 
frame should we target?

> Why didn't these patches show up 3 months ago when the last merge
> window closed?  Why not even earlier?

It was not a high priority, and I didn't push on it until after the trouble 
with proc_pid_readdir was resolved (and the fix floated downstream to me). 
   Sorry, but it was lost in higher priority work, and not something 
nagging at me, as I had already made the change on the systems I build for.

> I totally agree that what we are doing could be done better, however
> at this point we should be looking at 2.6.28.  In which case looking
> at the general long term non-hack solution is the right way to go.  Can
> we scale to different workloads?
> 
> For everyone with less then 4K cpus the current behavior is fine, and
> with 4k cpus it results in a modest slowdown.  This sounds useable.

I'd say the breakpoint - where increasing the size of the pid hash starts 
having a useful return - is more like 512 or 1024.  On NUMA boxes (which I 
think is most, if not all of the large processor count systems), running a 
list in the bucket (which more often than not will be remote) can be 
expensive, so we'd like to be closer to 1 process / bucket.

> You have hit an extremely sore spot with me.  Anytime someone makes an
> argument that I hear as RHEL is going to ship 2.6.27 so we _need_ this
> patch in 2.6.27 I want to stop listening.  I just don't care.  Unfortunately
> I have heard that argument almost once a day for the last week, and I am
> tired of it.

Only once a day?  Easy silly season, for having two major distributions 
taking a snapshot on 2.6.27...  I can see that getting annoying, and it's 
an unfortunate follow on effect of how Linux gets delivered to users who 
require commercial support and/or 3rd party application certifications for 
whatever reason (which unfortunately includes my users)...  Developers and 
users both need to push the major distributions to offer something 
reasonably current - we're both stuck with this silliness until users can 
count on new development being delivered in something a bit shorter than 
two years...

Caught in the middle, I ask both sides to push on the distributions at 
every opportunity! <push push>.

> Why hasn't someone complained that waitpid is still slow?

Is it?  I hadn't noticed, but I usually only go for the things users are in 
my cubicle complaining about, and I'm way downstream, so if it's not a 
problem there, I won't notice until I can get some time on a system to play 
with something current (within the next week or two, I hope).  I can look 
then, if you'd like.

> Why haven't we seen patches to reduce the number of kernel threads since
> last time you had problems with the pid infrastructure?
> 
> A very frustrated code reviewer.
> 
> So yes.  If you are not interested in 2.6.28 and in the general problem,
> I'm not interested in this problem.

Is there a general problem?

The last time we had trouble with the pid infrastructure, I believe it was 
the result of a patch leaking through, which, frankly, was quite poor.  I 
believe it's deficiencies have been addressed, and it looks like we now 
have a respectable implementation which should serve us well for a while.

There certainly is room for major architectural improvements.  Your ideas 
for moving from a hash to a radix are a good direction to take, and are 
something we should work on as processor counts continue to grow.  It is 
likely that we stand to gain in both raw cycles consumed as well as memory 
consumption - but we're not going to see that tomorrow.

I would think reducing process counts is also is a longer term project.  I 
wouldn't be looking at 2.6.28 for that, but rather 2.6.30 or so.  Most 
(possibly all) of the worst offenders appear to be using create_workqueue, 
which I don't expect will be trivial to change.  If someone picked up the 
task today, it might be ready for 2.6.29, but we may want more soak time, 
as it looks to me like an intrusive change with a high potential for 
unexpected consequences.

 From where I'm sitting, the current mechanism seems to do reasonably well, 
even with very large numbers of processes (hundreds of thousands), provided 
that the hash table is large enough to account for increased use.  The 
immediate barrier to adequate performance on large systems (that is, not 
unnecessarily wasting a significant portion of cycles) is the unreasonably 
low cap on the size of the hash table: it's an artificial limit, based on 
an outdated set of expectations about the sizes of systems.  As such, it's 
easy to extend the useful life of the current implementation with very 
little cost or effort.

A major rework with more efficient resource usage may be a higher priority 
for someone looking at higher processor counts with (relatively) tiny 
memory sizes.  If such people exist, it should not be difficult to take 
them into account when sizing the existing pid hash.

That's a short term (tomorrow-ish), very low risk project with immediate 
benefit: a small patch with no effect on systems <512c, which grows the pid 
hash when it is likely to be beneficial and there is plenty of memory to spare.

I'd really like to see an increased limit to the size of the pid hash in 
the near term.  If we can reduce process counts, we might revisit the 
sizing.  Better would be to start work on a more resource efficient 
implementation to eliminate it before we have to revisit it.  Ideal would 
be to move ahead with all three.  I don't see any (sensible) reason for any 
of these steps to be mutually exclusive.

-- 
Stephen Champion                           Silicon Graphics Site Team
schamp@(sgi.com|nas.nasa.gov)              NASA Advanced Supercomputing

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
  2008-08-01 12:04           ` Robin Holt
  2008-08-01 18:27             ` Eric W. Biederman
@ 2008-08-01 18:49             ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2008-08-01 18:49 UTC (permalink / raw)
  To: Robin Holt
  Cc: Eric W. Biederman, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Andrew Morton



On Fri, 1 Aug 2008, Robin Holt wrote:
> 
> It was initially detected while profiling 'ps' on a 2048p machine that
> had 13 kernel threads per cpu.

Grr. Wouldn't it be better to concentrate on that "13 kernel threads per 
cpu" part?

		Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-08-06  3:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-31 17:00 [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus() Robin Holt
2008-07-31 18:35 ` Eric W. Biederman
2008-07-31 19:32   ` Robin Holt
2008-07-31 19:49     ` Eric W. Biederman
2008-07-31 20:08       ` Robin Holt
2008-07-31 22:04         ` Eric W. Biederman
2008-08-01 12:04           ` Robin Holt
2008-08-01 18:27             ` Eric W. Biederman
2008-08-01 19:13               ` Robin Holt
2008-08-01 19:59                 ` Eric W. Biederman
2008-08-04 13:11                   ` Stephen Champion
2008-08-04 20:36                     ` Eric W. Biederman
2008-08-04 23:58                       ` Robin Holt
2008-08-05  0:38                         ` Eric W. Biederman
2008-08-06  3:21                           ` Stephen Champion
2008-08-01 18:49             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox