All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Stephen Champion <schamp@sgi.com>
Cc: Robin Holt <holt@sgi.com>,
	linux-kernel@vger.kernel.org, Pavel Emelyanov <xemul@openvz.org>,
	Oleg Nesterov <oleg@tv-sign.ru>,
	Sukadev Bhattiprolu <sukadev@us.ibm.com>,
	Paul Menage <menage@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().
Date: Mon, 04 Aug 2008 13:36:38 -0700	[thread overview]
Message-ID: <m17iawy0tl.fsf@frodo.ebiederm.org> (raw)
In-Reply-To: <4896FFFE.7080400@sgi.com> (Stephen Champion's message of "Mon, 04 Aug 2008 06:11:26 -0700")

Stephen Champion <schamp@sgi.com> writes:

> Despite more recent changes in proc_pid_readdir, my results should apply to
> current source.  It looks like both the old 2.6.16 implementation and the
> current version will call find_pid (or equivalent) once for each successive
> getdents() call on /proc, excepting when the cursor is on the first entry.  A
> quick look, and we have 88 getdents64() calls both 'ps' and 'ls /proc' with 29k
> processes running, which appears to be the primary source of calls.

Yes.  proc_pid_readdir->next_tgid->find_ge_pid does a hash table lookup.

> It's not giganormous, although I probably could come up with a pointless
> microbenchmark to show it's 300% better.  Importantly, it does noticeably
> improve normal interactive tools like 'ps' and 'top', a performance
> visualization tool developed by a customer (nodemon) refreshes faster.  For a
> 512k init allocation, that seems like a very good deal.

Alright.  The fact that it really is tools like ps and top that are proc intensive
shapes my opinion somewhat.  At it is doing lots and lots of hash tables lookups
so we can see each process that is expensive.  Rather then simple one at
a time lookups.

Hash table sizing gets us into a guessing game of how many pids do we expect
to deal with on this work load.  And we guess by either memory (how many hash
table entries can we afford) or cpu how many hash table entries can we expect.

There is an alternative to growing the hash table that I was playing
with at one time.  Use a radix tree that has essentially the same
properties for insert and delete as today's hash table. rcu for lookup
and a lock (spin lock?) for insert (memory allocation for the intermediate
nodes is the tricky bit).

Because pids are naturally dense we can accelerate the /proc accesses
by walking the radix tree and with a good implementation we only spend
as much memory as we have pids to worry about.

This is particular appealing to me in the context of pid namespaces.

I did not pursue it at the time I thought of it because I could not think
of a real workload that had enough pids to make the performance of the
current pid hash be noticable.  For our normal workload today my rough numbers
said the pid hash is optimal.

The interesting twist to this tale is that you have 29k processes, and
20k of them are kernel threads.  Which seems to say that it isn't user
space that is putting a heavy workloads on the pid infrastructure but
all of the per cpu kernel threads.

At only 9k processes we will have an average hash chain length of 2.25
which should perform well, as we have a good hash function and hash chain
lengths cluster well.  This is as opposed to the average hash chain length
of 7.25 which you are seeing now.

It looks to me like a cap of 64k is excessive we could place it down
at 16k (pid_hash_shift = 14) and your machines should perform well in
ps and top related activities, and it would only cost 128k.  If
you killed the excess kernel processes you could live very happily
with a 8k (pid_hash_shift = 13).

If we want something that tunes to the work load I expect a radix tree
would be best.  If the goal after 4k cpus is 64k cpus which I heard someone
mention I think that is what you really want.  A design that scales to
the workload on the system.

Eric

  reply	other threads:[~2008-08-04 20:43 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-31 17:00 [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus() Robin Holt
2008-07-31 18:35 ` Eric W. Biederman
2008-07-31 19:32   ` Robin Holt
2008-07-31 19:49     ` Eric W. Biederman
2008-07-31 20:08       ` Robin Holt
2008-07-31 22:04         ` Eric W. Biederman
2008-08-01 12:04           ` Robin Holt
2008-08-01 18:27             ` Eric W. Biederman
2008-08-01 19:13               ` Robin Holt
2008-08-01 19:59                 ` Eric W. Biederman
2008-08-04 13:11                   ` Stephen Champion
2008-08-04 20:36                     ` Eric W. Biederman [this message]
2008-08-04 23:58                       ` Robin Holt
2008-08-05  0:38                         ` Eric W. Biederman
2008-08-06  3:21                           ` Stephen Champion
2008-08-01 18:49             ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m17iawy0tl.fsf@frodo.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=akpm@linux-foundation.org \
    --cc=holt@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=menage@google.com \
    --cc=oleg@tv-sign.ru \
    --cc=schamp@sgi.com \
    --cc=sukadev@us.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.