* [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
@ 2008-07-31 17:00 Robin Holt
2008-07-31 18:35 ` Eric W. Biederman
0 siblings, 1 reply; 16+ messages in thread
From: Robin Holt @ 2008-07-31 17:00 UTC (permalink / raw)
To: linux-kernel
Cc: Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage,
Eric W. Biederman, Linus Torvalds, Andrew Morton
For large cpu configurations, we find the number of pids in a pidhash
bucket cause things like 'ps' to perform slowly. Raising pidhash_shift
from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine.
This patch makes the upper limit scale based upon num_possible_cpus().
For machines 128 cpus or less, the current upper limit of 12 is
maintained.
Signed-off-by: Robin Holt <holt@sgi.com>
Index: contention_unroll/kernel/pid.c
===================================================================
--- contention_unroll.orig/kernel/pid.c 2008-07-31 11:59:21.154284073 -0500
+++ contention_unroll/kernel/pid.c 2008-07-31 11:59:22.862497720 -0500
@@ -502,9 +502,10 @@ void __init pidhash_init(void)
{
int i, pidhash_size;
unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT);
+ int pidhash_shift_ul = max(12, fls(num_possible_cpus() - 1) + 5);
pidhash_shift = max(4, fls(megabytes * 4));
- pidhash_shift = min(12, pidhash_shift);
+ pidhash_shift = min(pidhash_shift_ul, pidhash_shift);
pidhash_size = 1 << pidhash_shift;
printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-07-31 17:00 [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus() Robin Holt @ 2008-07-31 18:35 ` Eric W. Biederman 2008-07-31 19:32 ` Robin Holt 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-07-31 18:35 UTC (permalink / raw) To: Robin Holt Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Robin Holt <holt@sgi.com> writes: > For large cpu configurations, we find the number of pids in a pidhash > bucket cause things like 'ps' to perform slowly. Raising pidhash_shift > from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine. > > This patch makes the upper limit scale based upon num_possible_cpus(). > For machines 128 cpus or less, the current upper limit of 12 is > maintained. It looks like there is a magic limit we are dancing around. Can we please make the maximum for the hash table size be based on the maximum number of pids. That is fls(PID_MAX_LIMIT) - 6? 12 actually looks like it was fls(PID_MAX_LIMIT) - 3 at one point in time. Basing the hash table size on PID_MAX_LIMIT seems to say interesting about the maximum number of hash chain entries we will tolerate in practice. Eric > Signed-off-by: Robin Holt <holt@sgi.com> > > > Index: contention_unroll/kernel/pid.c > =================================================================== > --- contention_unroll.orig/kernel/pid.c 2008-07-31 11:59:21.154284073 -0500 > +++ contention_unroll/kernel/pid.c 2008-07-31 11:59:22.862497720 -0500 > @@ -502,9 +502,10 @@ void __init pidhash_init(void) > { > int i, pidhash_size; > unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT); > + int pidhash_shift_ul = max(12, fls(num_possible_cpus() - 1) + 5); > > pidhash_shift = max(4, fls(megabytes * 4)); > - pidhash_shift = min(12, pidhash_shift); > + pidhash_shift = min(pidhash_shift_ul, pidhash_shift); > pidhash_size = 1 << pidhash_shift; > > printk("PID hash table entries: %d (order: %d, %Zd bytes)\n", ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-07-31 18:35 ` Eric W. Biederman @ 2008-07-31 19:32 ` Robin Holt 2008-07-31 19:49 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Robin Holt @ 2008-07-31 19:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton On Thu, Jul 31, 2008 at 11:35:19AM -0700, Eric W. Biederman wrote: > Robin Holt <holt@sgi.com> writes: > > > For large cpu configurations, we find the number of pids in a pidhash > > bucket cause things like 'ps' to perform slowly. Raising pidhash_shift > > from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine. > > > > This patch makes the upper limit scale based upon num_possible_cpus(). > > For machines 128 cpus or less, the current upper limit of 12 is > > maintained. > > It looks like there is a magic limit we are dancing around. > > Can we please make the maximum for the hash table size be based > on the maximum number of pids. That is fls(PID_MAX_LIMIT) - 6? I am happy to base it upon whatever you think is correct. So long as it goes up for machines with lots of cpus, that will satisfy me. It is probably as much a problem on smaller machines, but if you have _THAT_ many pids in use, you are probably oversubscribing many other resources and don't really care. That limit will essentially become a constant (compiler may even do that for us but I have not checked any arch other that ia64). Should I just replace the 12 with a 16 or 17 or some new magic number? Thanks, Robin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-07-31 19:32 ` Robin Holt @ 2008-07-31 19:49 ` Eric W. Biederman 2008-07-31 20:08 ` Robin Holt 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-07-31 19:49 UTC (permalink / raw) To: Robin Holt Cc: Eric W. Biederman, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Robin Holt <holt@sgi.com> writes: > On Thu, Jul 31, 2008 at 11:35:19AM -0700, Eric W. Biederman wrote: >> Robin Holt <holt@sgi.com> writes: >> >> > For large cpu configurations, we find the number of pids in a pidhash >> > bucket cause things like 'ps' to perform slowly. Raising pidhash_shift >> > from 12 to 16 cut the time for 'ps' in half on a 2048 cpu machine. >> > >> > This patch makes the upper limit scale based upon num_possible_cpus(). >> > For machines 128 cpus or less, the current upper limit of 12 is >> > maintained. >> >> It looks like there is a magic limit we are dancing around. >> >> Can we please make the maximum for the hash table size be based >> on the maximum number of pids. That is fls(PID_MAX_LIMIT) - 6? > > I am happy to base it upon whatever you think is correct. So long as it > goes up for machines with lots of cpus, that will satisfy me. It is > probably as much a problem on smaller machines, but if you have _THAT_ > many pids in use, you are probably oversubscribing many other resources > and don't really care. That limit will essentially become a constant > (compiler may even do that for us but I have not checked any arch other > that ia64). Should I just replace the 12 with a 16 or 17 or some new > magic number? I like setting the limit as a maximum hash chain length. Which is what fls(PID_MAX_LIMIT) - X is. X is the maximum hash chain length you can tolerate. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-07-31 19:49 ` Eric W. Biederman @ 2008-07-31 20:08 ` Robin Holt 2008-07-31 22:04 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Robin Holt @ 2008-07-31 20:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Like so??? I have not tested this yet. --- Set the upper limit for the pidhash_shift based upon PID_MAX_LIMIT. Signed-off-by: Robin Holt <holt@sgi.com> Index: contention_unroll/kernel/pid.c =================================================================== --- contention_unroll.orig/kernel/pid.c 2008-07-31 11:59:21.154284073 -0500 +++ contention_unroll/kernel/pid.c 2008-07-31 15:07:21.909158788 -0500 @@ -504,7 +504,7 @@ void __init pidhash_init(void) unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT); pidhash_shift = max(4, fls(megabytes * 4)); - pidhash_shift = min(12, pidhash_shift); + pidhash_shift = min(fls(PID_MAX_LIMIT) - 6, pidhash_shift); pidhash_size = 1 << pidhash_shift; printk("PID hash table entries: %d (order: %d, %Zd bytes)\n", ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-07-31 20:08 ` Robin Holt @ 2008-07-31 22:04 ` Eric W. Biederman 2008-08-01 12:04 ` Robin Holt 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-07-31 22:04 UTC (permalink / raw) To: Robin Holt Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Robin Holt <holt@sgi.com> writes: > Like so??? > > I have not tested this yet. Looks reasonable to me. In what circumstances was the lookup in the pid hash table with long changes causing a performance slowdown?. We don't perform a lot of lookups. > - pidhash_shift = min(12, pidhash_shift); > + pidhash_shift = min(fls(PID_MAX_LIMIT) - 6, pidhash_shift); > pidhash_size = 1 << pidhash_shift; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-07-31 22:04 ` Eric W. Biederman @ 2008-08-01 12:04 ` Robin Holt 2008-08-01 18:27 ` Eric W. Biederman 2008-08-01 18:49 ` Linus Torvalds 0 siblings, 2 replies; 16+ messages in thread From: Robin Holt @ 2008-08-01 12:04 UTC (permalink / raw) To: Eric W. Biederman Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton On Thu, Jul 31, 2008 at 03:04:56PM -0700, Eric W. Biederman wrote: > Robin Holt <holt@sgi.com> writes: > > > Like so??? > > > > I have not tested this yet. > > Looks reasonable to me. > > In what circumstances was the lookup in the pid hash table with > long changes causing a performance slowdown?. We don't perform > a lot of lookups. It was initially detected while profiling 'ps' on a 2048p machine that had 13 kernel threads per cpu. We added a couple more device drivers which added additional threads. We then started a pthread-on-process MPI job which had 2048 ranks each with 4 threads (test-case from customer job). There were misc other processes out there which brought our task count up to approx 63k. Larger page size helped the problem (went from 16k to 64k). Thanks, Robin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-01 12:04 ` Robin Holt @ 2008-08-01 18:27 ` Eric W. Biederman 2008-08-01 19:13 ` Robin Holt 2008-08-01 18:49 ` Linus Torvalds 1 sibling, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-08-01 18:27 UTC (permalink / raw) To: Robin Holt Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Robin Holt <holt@sgi.com> writes: > On Thu, Jul 31, 2008 at 03:04:56PM -0700, Eric W. Biederman wrote: >> Robin Holt <holt@sgi.com> writes: >> >> > Like so??? >> > >> > I have not tested this yet. >> >> Looks reasonable to me. >> >> In what circumstances was the lookup in the pid hash table with >> long changes causing a performance slowdown?. We don't perform >> a lot of lookups. > > It was initially detected while profiling 'ps' on a 2048p machine that > had 13 kernel threads per cpu. We added a couple more device drivers > which added additional threads. We then started a pthread-on-process > MPI job which had 2048 ranks each with 4 threads (test-case from > customer job). There were misc other processes out there which brought > our task count up to approx 63k. Larger page size helped the problem > (went from 16k to 64k). Large page size? Do you mean larger hash size? What were you measuring that showed improvement with the large hash size? Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-01 18:27 ` Eric W. Biederman @ 2008-08-01 19:13 ` Robin Holt 2008-08-01 19:59 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Robin Holt @ 2008-08-01 19:13 UTC (permalink / raw) To: Eric W. Biederman Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton On Fri, Aug 01, 2008 at 11:27:20AM -0700, Eric W. Biederman wrote: > Robin Holt <holt@sgi.com> writes: > > > On Thu, Jul 31, 2008 at 03:04:56PM -0700, Eric W. Biederman wrote: > >> Robin Holt <holt@sgi.com> writes: > >> > >> > Like so??? > >> > > >> > I have not tested this yet. > >> > >> Looks reasonable to me. > >> > >> In what circumstances was the lookup in the pid hash table with > >> long changes causing a performance slowdown?. We don't perform > >> a lot of lookups. > > > > It was initially detected while profiling 'ps' on a 2048p machine that > > had 13 kernel threads per cpu. We added a couple more device drivers > > which added additional threads. We then started a pthread-on-process > > MPI job which had 2048 ranks each with 4 threads (test-case from > > customer job). There were misc other processes out there which brought > > our task count up to approx 63k. Larger page size helped the problem > > (went from 16k to 64k). > > Large page size? Do you mean larger hash size? > > What were you measuring that showed improvement with the large hash size? Oops, confusing details. That was a different problem we had been tracking. Sorry for the confusion, Robin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-01 19:13 ` Robin Holt @ 2008-08-01 19:59 ` Eric W. Biederman 2008-08-04 13:11 ` Stephen Champion 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-08-01 19:59 UTC (permalink / raw) To: Robin Holt Cc: linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Robin Holt <holt@sgi.com> writes: > Oops, confusing details. That was a different problem we had been > tracking. Which leads back to the original question. What were you measuring that showed improvement with a larger pid hash size? Almost by definition a larger hash table will perform better. However my intuition is that we are talking about something that should be in the noise for most workloads. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-01 19:59 ` Eric W. Biederman @ 2008-08-04 13:11 ` Stephen Champion 2008-08-04 20:36 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Stephen Champion @ 2008-08-04 13:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Eric W. Biederman wrote: > Robin Holt <holt@sgi.com> writes: >> Oops, confusing details. That was a different problem we had been >> tracking. > > Which leads back to the original question. What were you measuring > that showed improvement with a larger pid hash size? > > Almost by definition a larger hash table will perform better. However > my intuition is that we are talking about something that should be in > the noise for most workloads. Robin asked me to chime in on this, as I did the early "look at that" work and suggested it to Robin. I noticed the potential for increasing pid_shift while chasing down a patch to our kernel (2.6.16 stable based) which had proc_pid_readdir() calling find_pid() for init_task through the highest pid #. This patch caused a rather serious problem on a 2048 core Altix. Before identifying the culprit, I increased pidhash_shift. This made a *huge* difference: enough to get the box marginally functional while I tracked down the origins of the problem. After backing out the problematic patch, I took a look at pidhash_shift in normal circumstances: With pidhash_shift == 12, running only a few common services and monitoring tools (sendmail, nagios, etc for ~28k active processes, mostly of the kernel variety), the 20 cpu boot cpuset we use on that system to confine normal system processes and interactive logins was spending >1% of it's time in find_pid(), and an 'ls /proc > /dev/null' took >0.4s. With pidhash_shift == 16, the timing went to <0.2, and the total time spent in find_pid() was reduced to noise level. In addition to raising the limit on larger systems, it looked reasonable to scale the pid hash with the # processors instead of memory. While I observed variably high process:cpu ratios on small systems (2c - 32c), they also have relatively few processes. The 192c - 2048c systems I was able to look at were all hovering at 13 +/- 2 processes per cpu, even with wildly varying memory sizes. Despite more recent changes in proc_pid_readdir, my results should apply to current source. It looks like both the old 2.6.16 implementation and the current version will call find_pid (or equivalent) once for each successive getdents() call on /proc, excepting when the cursor is on the first entry. A quick look, and we have 88 getdents64() calls both 'ps' and 'ls /proc' with 29k processes running, which appears to be the primary source of calls. It's not giganormous, although I probably could come up with a pointless microbenchmark to show it's 300% better. Importantly, it does noticeably improve normal interactive tools like 'ps' and 'top', a performance visualization tool developed by a customer (nodemon) refreshes faster. For a 512k init allocation, that seems like a very good deal. I'd like to lose 20,000 kernel processes in addition to growing the pid hash! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-04 13:11 ` Stephen Champion @ 2008-08-04 20:36 ` Eric W. Biederman 2008-08-04 23:58 ` Robin Holt 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-08-04 20:36 UTC (permalink / raw) To: Stephen Champion Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Stephen Champion <schamp@sgi.com> writes: > Despite more recent changes in proc_pid_readdir, my results should apply to > current source. It looks like both the old 2.6.16 implementation and the > current version will call find_pid (or equivalent) once for each successive > getdents() call on /proc, excepting when the cursor is on the first entry. A > quick look, and we have 88 getdents64() calls both 'ps' and 'ls /proc' with 29k > processes running, which appears to be the primary source of calls. Yes. proc_pid_readdir->next_tgid->find_ge_pid does a hash table lookup. > It's not giganormous, although I probably could come up with a pointless > microbenchmark to show it's 300% better. Importantly, it does noticeably > improve normal interactive tools like 'ps' and 'top', a performance > visualization tool developed by a customer (nodemon) refreshes faster. For a > 512k init allocation, that seems like a very good deal. Alright. The fact that it really is tools like ps and top that are proc intensive shapes my opinion somewhat. At it is doing lots and lots of hash tables lookups so we can see each process that is expensive. Rather then simple one at a time lookups. Hash table sizing gets us into a guessing game of how many pids do we expect to deal with on this work load. And we guess by either memory (how many hash table entries can we afford) or cpu how many hash table entries can we expect. There is an alternative to growing the hash table that I was playing with at one time. Use a radix tree that has essentially the same properties for insert and delete as today's hash table. rcu for lookup and a lock (spin lock?) for insert (memory allocation for the intermediate nodes is the tricky bit). Because pids are naturally dense we can accelerate the /proc accesses by walking the radix tree and with a good implementation we only spend as much memory as we have pids to worry about. This is particular appealing to me in the context of pid namespaces. I did not pursue it at the time I thought of it because I could not think of a real workload that had enough pids to make the performance of the current pid hash be noticable. For our normal workload today my rough numbers said the pid hash is optimal. The interesting twist to this tale is that you have 29k processes, and 20k of them are kernel threads. Which seems to say that it isn't user space that is putting a heavy workloads on the pid infrastructure but all of the per cpu kernel threads. At only 9k processes we will have an average hash chain length of 2.25 which should perform well, as we have a good hash function and hash chain lengths cluster well. This is as opposed to the average hash chain length of 7.25 which you are seeing now. It looks to me like a cap of 64k is excessive we could place it down at 16k (pid_hash_shift = 14) and your machines should perform well in ps and top related activities, and it would only cost 128k. If you killed the excess kernel processes you could live very happily with a 8k (pid_hash_shift = 13). If we want something that tunes to the work load I expect a radix tree would be best. If the goal after 4k cpus is 64k cpus which I heard someone mention I think that is what you really want. A design that scales to the workload on the system. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-04 20:36 ` Eric W. Biederman @ 2008-08-04 23:58 ` Robin Holt 2008-08-05 0:38 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Robin Holt @ 2008-08-04 23:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Champion, Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton On Mon, Aug 04, 2008 at 01:36:38PM -0700, Eric W. Biederman wrote: > Stephen Champion <schamp@sgi.com> writes: > If we want something that tunes to the work load I expect a radix tree > would be best. If the goal after 4k cpus is 64k cpus which I heard someone > mention I think that is what you really want. A design that scales to > the workload on the system. But if we simply scale based upon num_possible_cpus(), we get a relatively representative scaling function. Usually, customers buy machines with 1, 2, or 4GB per cpu. I would expect a waste of 256k, 512k, or even 1m to be acceptable at this size of machine. For 2.6.27, would you accept an upper cap based on the memory size algorithm you have now and adjusted for num_possible_cpus()? Essentially the first patch I posted. I would like to try and not be responsible for the radix tree implementation as I have other more pressing obligations. If, however, it was a condition of getting an interim solution into 2.6.27, I would be willing to discuss this with my management. Thanks, Robin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-04 23:58 ` Robin Holt @ 2008-08-05 0:38 ` Eric W. Biederman 2008-08-06 3:21 ` Stephen Champion 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2008-08-05 0:38 UTC (permalink / raw) To: Robin Holt Cc: Stephen Champion, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Robin Holt <holt@sgi.com> writes: > But if we simply scale based upon num_possible_cpus(), we get a relatively > representative scaling function. Usually, customers buy machines with 1, > 2, or 4GB per cpu. I would expect a waste of 256k, 512k, or even 1m to > be acceptable at this size of machine. For your customers, and your kernel thread workload, you get a reasonable representation. For other different people and different workloads you don't. I happen to know of a completely different class of workload that can do better. > For 2.6.27, would you accept an upper cap based on the memory size > algorithm you have now and adjusted for num_possible_cpus()? Essentially > the first patch I posted. I want to throw a screaming hissy fit. The merge window has closed. This is not a bug. This is not a regression. I don't see a single compelling reason to consider this for 2.6.27. I asked for clarification so I could be certain you were solving the right problem. Why didn't these patches show up 3 months ago when the last merge window closed? Why not even earlier? I totally agree that what we are doing could be done better, however at this point we should be looking at 2.6.28. In which case looking at the general long term non-hack solution is the right way to go. Can we scale to different workloads? For everyone with less then 4K cpus the current behavior is fine, and with 4k cpus it results in a modest slowdown. This sounds useable. You have hit an extremely sore spot with me. Anytime someone makes an argument that I hear as RHEL is going to ship 2.6.27 so we _need_ this patch in 2.6.27 I want to stop listening. I just don't care. Unfortunately I have heard that argument almost once a day for the last week, and I am tired of it. Why hasn't someone complained that waitpid is still slow? Why haven't we seen patches to reduce the number of kernel threads since last time you had problems with the pid infrastructure? A very frustrated code reviewer. So yes. If you are not interested in 2.6.28 and in the general problem, I'm not interested in this problem. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-05 0:38 ` Eric W. Biederman @ 2008-08-06 3:21 ` Stephen Champion 0 siblings, 0 replies; 16+ messages in thread From: Stephen Champion @ 2008-08-06 3:21 UTC (permalink / raw) To: Eric W. Biederman Cc: Robin Holt, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Linus Torvalds, Andrew Morton Eric W. Biederman wrote: > Robin Holt <holt@sgi.com> writes: > >> But if we simply scale based upon num_possible_cpus(), we get a relatively >> representative scaling function. Usually, customers buy machines with 1, >> 2, or 4GB per cpu. I would expect a waste of 256k, 512k, or even 1m to >> be acceptable at this size of machine. > > For your customers, and your kernel thread workload, you get a > reasonable representation. For other different people and different > workloads you don't. I happen to know of a completely different > class of workload that can do better. Although Robin probably had broader experience, I think we have both had opportunity to examine the workloads and configuration of a reasonable sample of the active (and historical) large (>=512c) shared memory systems. Some workloads and configurations are specialized, and perhaps less stressing that the mixed, volatile loads and array of services most of these systems are expected to handle, but the specialized loads have been the exceptions in my experience. That may change as the price/core continues to go down and pseudo-shared memory systems based on cluster interconnects become more common and possibly even viable, but don't hold your breathe. >> For 2.6.27, would you accept an upper cap based on the memory size >> algorithm you have now and adjusted for num_possible_cpus()? Essentially >> the first patch I posted. > > I want to throw a screaming hissy fit. If those get more cycles to my users, I'll start reading the list religiously! > The merge window has closed. This is not a bug. This is not a > regression. I don't see a single compelling reason to consider this > for 2.6.27. I asked for clarification so I could be certain you were > solving the right problem. Early in 2.6.28 might work for us. 2.6.27 would be nice. Yes, we'd like a distribution vendor(s) to pull it. If we ask nicely, the one which matters to me (and my users) is quite likely to take it if it has been accepted early in the next cycle. They've been very good about that sort of thing (for which I'm very thankful). So while it's extra administrivia, I'm not the one who has to fill out the forms and write up the justification ;-) But the opposite question: Does the patch proposed have significant risk or drawbacks? We know it offers a minor but noticeable performance improvement for at least some of the small set of systems it effects. Is it an unreasonable risk for other systems - or is there a known group of systems it would have an affect on which would not benefit or might even harm? Would a revision of it be acceptable, and if so, (based on answers to the prior questions) what criteria should a revision meet, and what time frame should we target? > Why didn't these patches show up 3 months ago when the last merge > window closed? Why not even earlier? It was not a high priority, and I didn't push on it until after the trouble with proc_pid_readdir was resolved (and the fix floated downstream to me). Sorry, but it was lost in higher priority work, and not something nagging at me, as I had already made the change on the systems I build for. > I totally agree that what we are doing could be done better, however > at this point we should be looking at 2.6.28. In which case looking > at the general long term non-hack solution is the right way to go. Can > we scale to different workloads? > > For everyone with less then 4K cpus the current behavior is fine, and > with 4k cpus it results in a modest slowdown. This sounds useable. I'd say the breakpoint - where increasing the size of the pid hash starts having a useful return - is more like 512 or 1024. On NUMA boxes (which I think is most, if not all of the large processor count systems), running a list in the bucket (which more often than not will be remote) can be expensive, so we'd like to be closer to 1 process / bucket. > You have hit an extremely sore spot with me. Anytime someone makes an > argument that I hear as RHEL is going to ship 2.6.27 so we _need_ this > patch in 2.6.27 I want to stop listening. I just don't care. Unfortunately > I have heard that argument almost once a day for the last week, and I am > tired of it. Only once a day? Easy silly season, for having two major distributions taking a snapshot on 2.6.27... I can see that getting annoying, and it's an unfortunate follow on effect of how Linux gets delivered to users who require commercial support and/or 3rd party application certifications for whatever reason (which unfortunately includes my users)... Developers and users both need to push the major distributions to offer something reasonably current - we're both stuck with this silliness until users can count on new development being delivered in something a bit shorter than two years... Caught in the middle, I ask both sides to push on the distributions at every opportunity! <push push>. > Why hasn't someone complained that waitpid is still slow? Is it? I hadn't noticed, but I usually only go for the things users are in my cubicle complaining about, and I'm way downstream, so if it's not a problem there, I won't notice until I can get some time on a system to play with something current (within the next week or two, I hope). I can look then, if you'd like. > Why haven't we seen patches to reduce the number of kernel threads since > last time you had problems with the pid infrastructure? > > A very frustrated code reviewer. > > So yes. If you are not interested in 2.6.28 and in the general problem, > I'm not interested in this problem. Is there a general problem? The last time we had trouble with the pid infrastructure, I believe it was the result of a patch leaking through, which, frankly, was quite poor. I believe it's deficiencies have been addressed, and it looks like we now have a respectable implementation which should serve us well for a while. There certainly is room for major architectural improvements. Your ideas for moving from a hash to a radix are a good direction to take, and are something we should work on as processor counts continue to grow. It is likely that we stand to gain in both raw cycles consumed as well as memory consumption - but we're not going to see that tomorrow. I would think reducing process counts is also is a longer term project. I wouldn't be looking at 2.6.28 for that, but rather 2.6.30 or so. Most (possibly all) of the worst offenders appear to be using create_workqueue, which I don't expect will be trivial to change. If someone picked up the task today, it might be ready for 2.6.29, but we may want more soak time, as it looks to me like an intrusive change with a high potential for unexpected consequences. From where I'm sitting, the current mechanism seems to do reasonably well, even with very large numbers of processes (hundreds of thousands), provided that the hash table is large enough to account for increased use. The immediate barrier to adequate performance on large systems (that is, not unnecessarily wasting a significant portion of cycles) is the unreasonably low cap on the size of the hash table: it's an artificial limit, based on an outdated set of expectations about the sizes of systems. As such, it's easy to extend the useful life of the current implementation with very little cost or effort. A major rework with more efficient resource usage may be a higher priority for someone looking at higher processor counts with (relatively) tiny memory sizes. If such people exist, it should not be difficult to take them into account when sizing the existing pid hash. That's a short term (tomorrow-ish), very low risk project with immediate benefit: a small patch with no effect on systems <512c, which grows the pid hash when it is likely to be beneficial and there is plenty of memory to spare. I'd really like to see an increased limit to the size of the pid hash in the near term. If we can reduce process counts, we might revisit the sizing. Better would be to start work on a more resource efficient implementation to eliminate it before we have to revisit it. Ideal would be to move ahead with all three. I don't see any (sensible) reason for any of these steps to be mutually exclusive. -- Stephen Champion Silicon Graphics Site Team schamp@(sgi.com|nas.nasa.gov) NASA Advanced Supercomputing ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus(). 2008-08-01 12:04 ` Robin Holt 2008-08-01 18:27 ` Eric W. Biederman @ 2008-08-01 18:49 ` Linus Torvalds 1 sibling, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2008-08-01 18:49 UTC (permalink / raw) To: Robin Holt Cc: Eric W. Biederman, linux-kernel, Pavel Emelyanov, Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage, Andrew Morton On Fri, 1 Aug 2008, Robin Holt wrote: > > It was initially detected while profiling 'ps' on a 2048p machine that > had 13 kernel threads per cpu. Grr. Wouldn't it be better to concentrate on that "13 kernel threads per cpu" part? Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-08-06 3:22 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-31 17:00 [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus() Robin Holt 2008-07-31 18:35 ` Eric W. Biederman 2008-07-31 19:32 ` Robin Holt 2008-07-31 19:49 ` Eric W. Biederman 2008-07-31 20:08 ` Robin Holt 2008-07-31 22:04 ` Eric W. Biederman 2008-08-01 12:04 ` Robin Holt 2008-08-01 18:27 ` Eric W. Biederman 2008-08-01 19:13 ` Robin Holt 2008-08-01 19:59 ` Eric W. Biederman 2008-08-04 13:11 ` Stephen Champion 2008-08-04 20:36 ` Eric W. Biederman 2008-08-04 23:58 ` Robin Holt 2008-08-05 0:38 ` Eric W. Biederman 2008-08-06 3:21 ` Stephen Champion 2008-08-01 18:49 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox