[RFC] kernel/pid.c pid allocation wierdness

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] kernel/pid.c pid allocation wierdness
@ 2007-03-14  7:30 Pavel Emelianov
  2007-03-14 14:12 ` Eric W. Biederman
  2007-03-14 14:43 ` William Lee Irwin III
  0 siblings, 2 replies; 14+ messages in thread
From: Pavel Emelianov @ 2007-03-14  7:30 UTC (permalink / raw)
  To: Eric W. Biederman, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List

Hi.

I'm looking at how alloc_pid() works and can't understand
one (simple/stupid) thing.

It first kmem_cache_alloc()-s a strct pid, then calls
alloc_pidmap() and at the end it taks a global pidmap_lock()
to add new pid to hash.

The question is - why does alloc_pidmap() use at least
two atomic ops and potentially loop to find a zero bit
in pidmap? Why not call alloc_pidmap() under pidmap_lock
and find zero pid in pidmap w/o any loops and atomics?

The same is for free_pid(). Do I miss something?

Thank,
Pavel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14  7:30 [RFC] kernel/pid.c pid allocation wierdness Pavel Emelianov
@ 2007-03-14 14:12 ` Eric W. Biederman
  2007-03-14 15:03   ` William Lee Irwin III
  2007-03-14 15:33   ` Oleg Nesterov
  2007-03-14 14:43 ` William Lee Irwin III
  1 sibling, 2 replies; 14+ messages in thread
From: Eric W. Biederman @ 2007-03-14 14:12 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Sukadev Bhattiprolu, Serge Hallyn, Linux Kernel Mailing List,
	Oleg Nesterov, Linux Containers

Pavel Emelianov <xemul@sw.ru> writes:

> Hi.
>
> I'm looking at how alloc_pid() works and can't understand
> one (simple/stupid) thing.
>
> It first kmem_cache_alloc()-s a strct pid, then calls
> alloc_pidmap() and at the end it taks a global pidmap_lock()
> to add new pid to hash.
>
> The question is - why does alloc_pidmap() use at least
> two atomic ops and potentially loop to find a zero bit
> in pidmap? Why not call alloc_pidmap() under pidmap_lock
> and find zero pid in pidmap w/o any loops and atomics?
>
> The same is for free_pid(). Do I miss something?

Well as far as I can tell that is just the way the code
evolved.

Looking at the history.  At the time I started messing with it
alloc_pidmap was the function and it behaved pretty much as it
does today with locking (except it didn't disable irqs).

To add the allocation of struct pid.  I added alloc_pid
as a wrapper.  Left alloc_pidmap alone, and added the hash
table manipulation code.  I know this results is fairly
short hold times which is moderately important for a global lock.

We loop in alloc_pidnmap because of what we are trying to do.  Simply
returning the first free pid number would have bad effects on user
space, so we have the simple requirement that we don't reuse pid
numbers for as long as is practical.  We achieve that doing full walks
through the pid space before we consider a pid again.  So we have to
start from the last place we looked.  In addition  we may have
multiple pages of bitmap to traverse (when our pid limit is high) and
those pages are not physically contiguous.

So while I wouldn't call alloc_pidmap perfect it does seem to be
reasonable.

>From what I can tell for the low number of pids that we usually have
the pid hash table seems near optimal.

If we do dig into this more we need to consider a radix_tree to hold
the pid values.  That could replace both the pid map and the hash
table, gracefully handle but large and small pid counts, might
be a smidgin simpler, possibly be more space efficient, and it would
more easily handle multiple pid namespaces.   The downside to using a
radix tree is that is looks like it will have more cache misses for
the normal pid map size, and it is yet another change that we would
need to validate.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14 14:12 ` Eric W. Biederman
@ 2007-03-14 15:03   ` William Lee Irwin III
  2007-03-14 16:54     ` Eric W. Biederman
  2007-03-14 15:33   ` Oleg Nesterov
  1 sibling, 1 reply; 14+ messages in thread
From: William Lee Irwin III @ 2007-03-14 15:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Oleg Nesterov, Linux Containers

On Wed, Mar 14, 2007 at 08:12:35AM -0600, Eric W. Biederman wrote:
> If we do dig into this more we need to consider a radix_tree to hold
> the pid values.  That could replace both the pid map and the hash
> table, gracefully handle but large and small pid counts, might
> be a smidgin simpler, possibly be more space efficient, and it would
> more easily handle multiple pid namespaces.   The downside to using a
> radix tree is that is looks like it will have more cache misses for
> the normal pid map size, and it is yet another change that we would
> need to validate.

Radix trees' space behavior is extremely poor in sparsely-populated
index spaces. There is no way they would save space or even come close
to the current space footprint.

Lock contention here would be a severe functional regression (note
"functional," not "performance;" the lock contention surrounding these
affairs takes down systems vs. mere slowdown nonsense), so it would
necessarily depend on lockless radix tree code for correctness.

The comment block describing the hashtable locking is stale and should
have been updated in tandem with the RCU changes.

-- wli

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14 15:03   ` William Lee Irwin III
@ 2007-03-14 16:54     ` Eric W. Biederman
  2007-03-15 20:26       ` William Lee Irwin III
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2007-03-14 16:54 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Oleg Nesterov, Linux Containers

William Lee Irwin III <wli@holomorphy.com> writes:

> On Wed, Mar 14, 2007 at 08:12:35AM -0600, Eric W. Biederman wrote:
>> If we do dig into this more we need to consider a radix_tree to hold
>> the pid values.  That could replace both the pid map and the hash
>> table, gracefully handle but large and small pid counts, might
>> be a smidgin simpler, possibly be more space efficient, and it would
>> more easily handle multiple pid namespaces.   The downside to using a
>> radix tree is that is looks like it will have more cache misses for
>> the normal pid map size, and it is yet another change that we would
>> need to validate.
>
> Radix trees' space behavior is extremely poor in sparsely-populated
> index spaces. There is no way they would save space or even come close
> to the current space footprint.

Possibly.  We aren't that sparsely populated when it comes to pids.
Hash tables aren't good at saving space either, and when they are space
efficient they are on the edge of long hash chains so they are on
the edge of performance problems.  There is at least one variant of the
fib tree that is as space efficient as any binary tree but works by
looking at bits.  Not that I think that one makes much sense.

> Lock contention here would be a severe functional regression (note
> "functional," not "performance;" the lock contention surrounding these
> affairs takes down systems vs. mere slowdown nonsense), so it would
> necessarily depend on lockless radix tree code for correctness.

I don't know about the existing in kernel implementations but there is no
reason we could not have an rcu protected radix tree.  At which point the
challenges are about the same but we have indexes that would help us find
the next free bit, which could reduce cpu time.

The current pid implementation does not scale to larger process counts
particularly well.  The hash table size is fixed so we get a lot of hash
collisions etc.  Several other things go wonky as well.  The one time
I actually had 64K pids on a system (most of them zombies) it was not
a very pleasant situation.

It isn't common that we push the pid count up, and with the normal pid
counts the current data structures seem very well suited to the
problem.  I have been looking at the data structures though in case it
ever changes because the current good behavior seems quite fragile.
Not that I am advocating changing anything yet, but I'm thinking about
it so when we do come to the point where it matters we can make a
reasonable change.

> The comment block describing the hashtable locking is stale and should
> have been updated in tandem with the RCU changes.

Feel free to submit the patch.  I didn't make the RCU changes just took
advantage of them.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14 16:54     ` Eric W. Biederman
@ 2007-03-15 20:26       ` William Lee Irwin III
  2007-03-16 13:04         ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: William Lee Irwin III @ 2007-03-15 20:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Oleg Nesterov, Linux Containers

William Lee Irwin III <wli@holomorphy.com> writes:
>> Radix trees' space behavior is extremely poor in sparsely-populated
>> index spaces. There is no way they would save space or even come close
>> to the current space footprint.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> Possibly.  We aren't that sparsely populated when it comes to pids.
> Hash tables aren't good at saving space either, and when they are space
> efficient they are on the edge of long hash chains so they are on
> the edge of performance problems.  There is at least one variant of the
> fib tree that is as space efficient as any binary tree but works by
> looking at bits.  Not that I think that one makes much sense.

I'd not mind something better than a hashtable. The fib tree may make
more sense than anticipated. It's truly better to switch data structures
completely than fiddle with e.g. hashtable sizes. However, bear in mind
the degenerate space behavior of radix trees in sparse contexts.

William Lee Irwin III <wli@holomorphy.com> writes:
>> Lock contention here would be a severe functional regression (note
>> "functional," not "performance;" the lock contention surrounding these
>> affairs takes down systems vs. mere slowdown nonsense), so it would
>> necessarily depend on lockless radix tree code for correctness.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> I don't know about the existing in kernel implementations but there is no
> reason we could not have an rcu protected radix tree.  At which point the
> challenges are about the same but we have indexes that would help us find
> the next free bit, which could reduce cpu time.

RCU'ing radix trees is trendy but the current implementation needs a
spinlock where typically radix trees do not need them for RCU. I'm
talking this over with others interested in lockless radix tree
algorithms for reasons other than concurrency and/or parallelism.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> The current pid implementation does not scale to larger process counts
> particularly well.  The hash table size is fixed so we get a lot of hash
> collisions etc.  Several other things go wonky as well.  The one time
> I actually had 64K pids on a system (most of them zombies) it was not
> a very pleasant situation.

If you've got a microbenchmark that would be convenient for me to use
while addressing it, unless you'd prefer to take it on yourself.
Otherwise I can always write one myself.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> It isn't common that we push the pid count up, and with the normal pid
> counts the current data structures seem very well suited to the
> problem.  I have been looking at the data structures though in case it
> ever changes because the current good behavior seems quite fragile.
> Not that I am advocating changing anything yet, but I'm thinking about
> it so when we do come to the point where it matters we can make a
> reasonable change.

I'd say you already have enough evidence to motivate a change of data
structure. Tree hashing (e.g. using balanced search trees in collision
chains) is generally good for eliminating straight-line issues of this
form but the available in-kernel tree structures don't do so well in
concurrent/parallel contexts and the utility of hashing becomes somewhat
questionable afterward given the stringent limits kernel environments
impose on pid spaces. I favor Peter Zijlstra's B+ trees once a few
relatively minor issues are addressed on account of good behavior in
sparse keyspaces, though cleanups (not stylistic) of radix trees' space
behavior may yet render them suitable, and may also be more popular to
pursue.

Basically all that's needed for radix trees' space behavior to get
fixed up is proper path compression as opposed to the ->depth hack.
Done properly it also eliminates the need for a spinlock around the
whole radix tree for RCU.

William Lee Irwin III <wli@holomorphy.com> writes:
>> The comment block describing the hashtable locking is stale and should
>> have been updated in tandem with the RCU changes.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> Feel free to submit the patch.  I didn't make the RCU changes just took
> advantage of them.

I needed to note that because it and my description were in direct
conflict. I'd rather leave it for a kernel janitor or someone who needs
to get patches under their belt to sweep up as I've got enough laurels
to rest on and other areas where I can get large amounts of code in
easily enough, provided I get my act together on various fronts.

-- wli

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-15 20:26       ` William Lee Irwin III
@ 2007-03-16 13:04         ` Eric W. Biederman
  2007-03-16 19:46           ` William Lee Irwin III
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2007-03-16 13:04 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Oleg Nesterov, Linux Containers

William Lee Irwin III <wli@holomorphy.com> writes:

> William Lee Irwin III <wli@holomorphy.com> writes:
>>> Radix trees' space behavior is extremely poor in sparsely-populated
>>> index spaces. There is no way they would save space or even come close
>>> to the current space footprint.
>
> On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
>> Possibly.  We aren't that sparsely populated when it comes to pids.
>> Hash tables aren't good at saving space either, and when they are space
>> efficient they are on the edge of long hash chains so they are on
>> the edge of performance problems.  There is at least one variant of the
>> fib tree that is as space efficient as any binary tree but works by
>> looking at bits.  Not that I think that one makes much sense.
>
> I'd not mind something better than a hashtable. The fib tree may make
> more sense than anticipated. It's truly better to switch data structures
> completely than fiddle with e.g. hashtable sizes. However, bear in mind
> the degenerate space behavior of radix trees in sparse contexts.

Grr.  s/patricia tree/fib tree/.  We use that in the networking for
the forwarding information base and I got mis-remembered it.  Anyway
the interesting thing with the binary version of radix tree is that
path compression is well defined.  Path compression when you have
multi-way branching is much more difficult.

Sure.  One of the reasons to be careful with switching data
structures.  Currently the hash tables typically operate at 10:1
unsued:used entries.  4096 entries and 100 processes.

The current work actually focuses on changing the code so we reduce
the total number of hash table looks ups, but persistently storing
struct pid pointers instead of storing a pid_t.  This has a lot
of benefits when it comes to implementing a pid namespace but the
secondary performance benefit is nice as well.

Although my preliminary concern was the increase in typical list
traversal length during lookup.  The current hash table typically does
not have collisions so normally there is nothing to traverse.

Meanwhile an rcu tree would tend towards 2-3 levels which would double
or triple our number of cache line misses on lookup and thus reduce
performance in the normal case.

> William Lee Irwin III <wli@holomorphy.com> writes:
>>> Lock contention here would be a severe functional regression (note
>>> "functional," not "performance;" the lock contention surrounding these
>>> affairs takes down systems vs. mere slowdown nonsense), so it would
>>> necessarily depend on lockless radix tree code for correctness.
>
> On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
>> I don't know about the existing in kernel implementations but there is no
>> reason we could not have an rcu protected radix tree.  At which point the
>> challenges are about the same but we have indexes that would help us find
>> the next free bit, which could reduce cpu time.
>
> RCU'ing radix trees is trendy but the current implementation needs a
> spinlock where typically radix trees do not need them for RCU. I'm
> talking this over with others interested in lockless radix tree
> algorithms for reasons other than concurrency and/or parallelism.

Sure.  I was thinking of the general class of data structures not the
existing kernel one.  The multi-way branching and lack of comparisons
looks interesting as a hash table replacement.  In a lot of cases
storing hash values in a radix tree seems a very interesting
alternative to a traditional hash table (and in that case the keys
are randomly distributed and can easily be made dense).   For pids
hashing the values first seems like a waste, and 

> On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
>> The current pid implementation does not scale to larger process counts
>> particularly well.  The hash table size is fixed so we get a lot of hash
>> collisions etc.  Several other things go wonky as well.  The one time
>> I actually had 64K pids on a system (most of them zombies) it was not
>> a very pleasant situation.
>
> If you've got a microbenchmark that would be convenient for me to use
> while addressing it, unless you'd prefer to take it on yourself.
> Otherwise I can always write one myself.

I don't.  I just have some observations from tracking a bug where we
were creating unreapable zombies, and so the reproducer despite trying
to reap it's zombies created a lot of zombies.  I recall that ps was
noticeably slow.  Beyond that I don't have good data.

> On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
>> It isn't common that we push the pid count up, and with the normal pid
>> counts the current data structures seem very well suited to the
>> problem.  I have been looking at the data structures though in case it
>> ever changes because the current good behavior seems quite fragile.
>> Not that I am advocating changing anything yet, but I'm thinking about
>> it so when we do come to the point where it matters we can make a
>> reasonable change.
>
> I'd say you already have enough evidence to motivate a change of data
> structure.

I have enough to know that a better data structure could improve
things.  Getting something that is clearly better than what
we have now is difficult.  Plus I have a lot of other patches to
coordinate.  I don't think the current data structure is going to
fall over before I get the pid namespace implemented so that is
my current priority.

> Tree hashing (e.g. using balanced search trees in collision
> chains) is generally good for eliminating straight-line issues of this
> form but the available in-kernel tree structures don't do so well in
> concurrent/parallel contexts and the utility of hashing becomes somewhat
> questionable afterward given the stringent limits kernel environments
> impose on pid spaces. I favor Peter Zijlstra's B+ trees once a few
> relatively minor issues are addressed on account of good behavior in
> sparse keyspaces, though cleanups (not stylistic) of radix trees' space
> behavior may yet render them suitable, and may also be more popular to
> pursue.
>
> Basically all that's needed for radix trees' space behavior to get
> fixed up is proper path compression as opposed to the ->depth hack.
> Done properly it also eliminates the need for a spinlock around the
> whole radix tree for RCU.

Yes.   Although I have some doubts about path compression.  A B+ tree
would be a hair more expensive (as you have to binary search the keys
before descending) and that places a greater emphasis on all of the
keys fitting in one cache line.  Still both data structures are
interesting.

My secondary use is that I need something for which I can do a full
traversal with for use in readdir.  If we don't have a total ordering
that is easy and safe to store in the file offset field readdir can
loose entries.  Currently I use the pid value itself for this.

Simply limiting tree height in the case of pids which are relatively
dense is likely to be enough.

> William Lee Irwin III <wli@holomorphy.com> writes:
>>> The comment block describing the hashtable locking is stale and should
>>> have been updated in tandem with the RCU changes.
>
> On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
>> Feel free to submit the patch.  I didn't make the RCU changes just took
>> advantage of them.
>
> I needed to note that because it and my description were in direct
> conflict. I'd rather leave it for a kernel janitor or someone who needs
> to get patches under their belt to sweep up as I've got enough laurels
> to rest on and other areas where I can get large amounts of code in
> easily enough, provided I get my act together on various fronts.

Make sense.  I still haven't seen that comment yet...
Quite probably because I have a major blind spot for comments
describing how the code works unless they are higher level comments.
I just read the code.  I think I only resort to reading the comments
only if I am confused as to what the code is doing or why it is doing
what it is doing.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-16 13:04         ` Eric W. Biederman
@ 2007-03-16 19:46           ` William Lee Irwin III
  2007-03-16 21:18             ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: William Lee Irwin III @ 2007-03-16 19:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Oleg Nesterov, Linux Containers

William Lee Irwin III <wli@holomorphy.com> writes:
>> I'd not mind something better than a hashtable. The fib tree may make
>> more sense than anticipated. It's truly better to switch data structures
>> completely than fiddle with e.g. hashtable sizes. However, bear in mind
>> the degenerate space behavior of radix trees in sparse contexts.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Grr.  s/patricia tree/fib tree/.  We use that in the networking for
> the forwarding information base and I got mis-remembered it.  Anyway
> the interesting thing with the binary version of radix tree is that
> path compression is well defined.  Path compression when you have
> multi-way branching is much more difficult.

Path compression isn't a big deal for multiway branching. I've usually
done it by aligning nodes and or'ing the number of levels to skip into
the lower bits of the pointer.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Sure.  One of the reasons to be careful with switching data
> structures.  Currently the hash tables typically operate at 10:1
> unsued:used entries.  4096 entries and 100 processes.

That would be 40:1, which is "worse" in some senses. That's not
going to fly well when pid namespaces proliferate.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> The current work actually focuses on changing the code so we reduce
> the total number of hash table looks ups, but persistently storing
> struct pid pointers instead of storing a pid_t.  This has a lot
> of benefits when it comes to implementing a pid namespace but the
> secondary performance benefit is nice as well.

I can't quite make out what you mean by all that. struct pid is already
what's in the hashtable.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Although my preliminary concern was the increase in typical list
> traversal length during lookup.  The current hash table typically does
> not have collisions so normally there is nothing to traverse.

Define "normally;" 10000 threads and/or processes can be standard for
some affairs.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Meanwhile an rcu tree would tend towards 2-3 levels which would double
> or triple our number of cache line misses on lookup and thus reduce
> performance in the normal case.

It depends on the sort of tree used. For B+ it depends on branching
factors. For hash tries (not previously discussed) with path compression
new levels would only be created when the maximum node size is exceeded.
For radix trees (I'm thinking hand-coded) with path compression it
depends on dispersion especially above the pseudo-levels used for the
lowest bits.

William Lee Irwin III <wli@holomorphy.com> writes:
>> RCU'ing radix trees is trendy but the current implementation needs a
>> spinlock where typically radix trees do not need them for RCU. I'm
>> talking this over with others interested in lockless radix tree
>> algorithms for reasons other than concurrency and/or parallelism.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Sure.  I was thinking of the general class of data structures not the
> existing kernel one.  The multi-way branching and lack of comparisons
> looks interesting as a hash table replacement.  In a lot of cases
> storing hash values in a radix tree seems a very interesting
> alternative to a traditional hash table (and in that case the keys
> are randomly distributed and can easily be made dense).   For pids
> hashing the values first seems like a waste, and 

Comparisons vs. no comparisons is less interesting than cachelines
touched. B+ would either make RCU inconvenient or want comparisons by
dereferencing, which raises the number of cachelines touched.

William Lee Irwin III <wli@holomorphy.com> writes:
>> I'd say you already have enough evidence to motivate a change of data
>> structure.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> I have enough to know that a better data structure could improve
> things.  Getting something that is clearly better than what
> we have now is difficult.  Plus I have a lot of other patches to
> coordinate.  I don't think the current data structure is going to
> fall over before I get the pid namespace implemented so that is
> my current priority.

I'll look into the data structure code, then.

William Lee Irwin III <wli@holomorphy.com> writes:
>> Basically all that's needed for radix trees' space behavior to get
>> fixed up is proper path compression as opposed to the ->depth hack.
>> Done properly it also eliminates the need for a spinlock around the
>> whole radix tree for RCU.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Yes.   Although I have some doubts about path compression.  A B+ tree
> would be a hair more expensive (as you have to binary search the keys
> before descending) and that places a greater emphasis on all of the
> keys fitting in one cache line.  Still both data structures are
> interesting.

Path compression is no big deal and resolves the radix tree space
issues to my satisfaction. It's actuallly already in lib/radix-tree.c
but in an ineffective form. It's more to do with numbers of cachelines
touched. Having to fetch pid numbers from struct pid's is not going to
be very good on that front.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> My secondary use is that I need something for which I can do a full
> traversal with for use in readdir.  If we don't have a total ordering
> that is easy and safe to store in the file offset field readdir can
> loose entries.  Currently I use the pid value itself for this.
> Simply limiting tree height in the case of pids which are relatively
> dense is likely to be enough.

Manfred's readdir code should've dealt with that at least partially.
B+ trees also resolve that quite well, despite their other issues.
Sacrificing fully lockless RCU on B+ trees and wrapping writes with
spinlocks should allow pid numbers to be stored alongside pointers
in leaf nodes at the further cost of a branching factor reduction.
TLB will at least be conserved even with a degraded branching factor.

Anyway, I'm loath to use lib/radix-tree.c but a different radix tree
implementation I could run with. I've gone over some of the other
alternatives. Give me an idea of where you want me to go with the data
structure selection and I can sweep that up for you. I'm not attached
to any particular one, though I am repelled from one in particular
(which I can do anyway even if only for comparison purposes, though if
it happens to be best I'll concede and let it through anyway).

I can also defer the data structure switch to you if you really want
to reserve that for yourself.

-- wli

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-16 19:46           ` William Lee Irwin III
@ 2007-03-16 21:18             ` Eric W. Biederman
  0 siblings, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2007-03-16 21:18 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Oleg Nesterov, Linux Containers

William Lee Irwin III <wli@holomorphy.com> writes:
>
> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> Grr.  s/patricia tree/fib tree/.  We use that in the networking for
>> the forwarding information base and I got mis-remembered it.  Anyway
>> the interesting thing with the binary version of radix tree is that
>> path compression is well defined.  Path compression when you have
>> multi-way branching is much more difficult.
>
> Path compression isn't a big deal for multiway branching. I've usually
> done it by aligning nodes and or'ing the number of levels to skip into
> the lower bits of the pointer.

Hmm.  I guess what I have seen it that it was simply  more difficult
because there were fewer opportunities the bigger the branching factor
but I haven't looked at it very closely.

> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> Sure.  One of the reasons to be careful with switching data
>> structures.  Currently the hash tables typically operate at 10:1
>> unsued:used entries.  4096 entries and 100 processes.
>
> That would be 40:1, which is "worse" in some senses. That's not
> going to fly well when pid namespaces proliferate.

Agreed, currently the plan it to add an namespace parameter to hash table
compares during lookups.  Allocating hash tables at run time is almost
impossible  to do reliably because of the multiple page allocations.

> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> The current work actually focuses on changing the code so we reduce
>> the total number of hash table looks ups, but persistently storing
>> struct pid pointers instead of storing a pid_t.  This has a lot
>> of benefits when it comes to implementing a pid namespace but the
>> secondary performance benefit is nice as well.
>
> I can't quite make out what you mean by all that. struct pid is already
> what's in the hashtable.

Yes.  But I have given it a life of it's own as well.  Which means instead
of caching a pid_t value in a long lived data structure we can hold 
a struct pid *.  So that means we have fewer total hash table look
ups.

> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> Although my preliminary concern was the increase in typical list
>> traversal length during lookup.  The current hash table typically does
>> not have collisions so normally there is nothing to traverse.
>
> Define "normally;" 10000 threads and/or processes can be standard for
> some affairs.

When I did a quick survey of systems I could easily find everything
was much lower than.  I wasn't been able to find those setups in my
quick survey.  I was looking for systems with long hash chains to
justify a data structure switch especially systems that needed to
push up the default pid limit, but I didn't encounter them.

So that said to me the common case was well handled by the current
setup.  Especially where even at 10000 we only have normal hash chain
lengths of 3 to 4 (3 to 5?).  I did a little modeling and our hash
function was good enough that it generally gave a good distribution of
pid values across the buckets.

My memory is something like the really nasty cases only occur when
we start pushing /proc/sys/kernel/pid_max above it's default at 32768.

Our worst case has pid hash chains of 1k entries which clearly sucks.

> William Lee Irwin III <wli@holomorphy.com> writes:
>>> RCU'ing radix trees is trendy but the current implementation needs a
>>> spinlock where typically radix trees do not need them for RCU. I'm
>>> talking this over with others interested in lockless radix tree
>>> algorithms for reasons other than concurrency and/or parallelism.
>
> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> Sure.  I was thinking of the general class of data structures not the
>> existing kernel one.  The multi-way branching and lack of comparisons
>> looks interesting as a hash table replacement.  In a lot of cases
>> storing hash values in a radix tree seems a very interesting
>> alternative to a traditional hash table (and in that case the keys
>> are randomly distributed and can easily be made dense).   For pids
>> hashing the values first seems like a waste, and 
>
> Comparisons vs. no comparisons is less interesting than cachelines
> touched. B+ would either make RCU inconvenient or want comparisons by
> dereferencing, which raises the number of cachelines touched.

Agreed.  Hmm.  I didn't say that too well, I was thinking of the lack
of comparisons implying fewer cache line touches.

> William Lee Irwin III <wli@holomorphy.com> writes:
>>> I'd say you already have enough evidence to motivate a change of data
>>> structure.
>
> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> I have enough to know that a better data structure could improve
>> things.  Getting something that is clearly better than what
>> we have now is difficult.  Plus I have a lot of other patches to
>> coordinate.  I don't think the current data structure is going to
>> fall over before I get the pid namespace implemented so that is
>> my current priority.
>
> I'll look into the data structure code, then.

Thanks.

>
> On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
>> My secondary use is that I need something for which I can do a full
>> traversal with for use in readdir.  If we don't have a total ordering
>> that is easy and safe to store in the file offset field readdir can
>> loose entries.  Currently I use the pid value itself for this.
>> Simply limiting tree height in the case of pids which are relatively
>> dense is likely to be enough.
>
> Manfred's readdir code should've dealt with that at least partially.

Partially sounds correct.  I had to replace it recently because it
was possible for readdir to skip a process that existed for the entire
length of an opendir, readdir_loop, closedir session.  It took a
process exiting to trigger it so it was rare but the semantics were
impossible for user space to work around.

The problem is if the process we stop on disappears things must be
well ordered enough that we can find the next succeeding process.
The previous code (which I assume Manfred did from skimming the
changelog) would simply count forward in a fixed number of processes
in the process list (when the process it had stop on died).  Which
since we always append to the end would of the task list would skip
processes immediately after those that had exited.

> B+ trees also resolve that quite well, despite their other issues.
> Sacrificing fully lockless RCU on B+ trees and wrapping writes with
> spinlocks should allow pid numbers to be stored alongside pointers
> in leaf nodes at the further cost of a branching factor reduction.
> TLB will at least be conserved even with a degraded branching factor.
>
> Anyway, I'm loath to use lib/radix-tree.c but a different radix tree
> implementation I could run with. I've gone over some of the other
> alternatives. Give me an idea of where you want me to go with the data
> structure selection and I can sweep that up for you. I'm not attached
> to any particular one, though I am repelled from one in particular
> (which I can do anyway even if only for comparison purposes, though if
> it happens to be best I'll concede and let it through anyway).
>
> I can also defer the data structure switch to you if you really want
> to reserve that for yourself.

No.  If someone else is interested and can do the work I don't want
to reserve it for myself.

As long as I get to help review the changes I don't have any real
preferences for data structures as long it meets the needs of the pid
lookup.

I should mention there is also a subtle issue at the leaf nodes.  The
pid namespaces will be hierarchical with parents fully containing
their children.  Each struct pid will appear in it's current pid
namespace and all parent pid namespaces (or else we can't use
traditional unix process control functions (like kill, wait, and
ptrace)).  Each struct pid will be assigned a different pid_t value in
each pid namespace.

When we want the pid value of a process that isn't current there will
be a function pid_nr() that looks at our current pid namespace and
finds the pid_t value that corresponds to that pid namespace.

In general I don't expect the hierarchy of pid namespaces to be very
deep 1 for the traditional case and when we are actually taking
advantage of the pid namespaces 2 or 3, and so we might be able to
optimize taking that into account as long as the interfaces don't have
that assumption.  That observation does mean pid_nr can easily be a
walk through all of the possibilities. 

The implication there is that we might end up with:
struct pid_lookup_entry {
       pid_t nr;
       struct list_head lookup_list;
       struct pid_namespace *ns;
       struct pid *pid;
       struct list_head pid_list;
};

struct pid
{
	atomic_t count;
        struct list_head pid_list;
	/* lists of tasks that use this pid */
	struct hlist_head tasks[PIDTYPE_MAX];
	struct rcu_head rcu;
};

Alternatively it might just be:

struct pid
{
	atomic_t count;
	/* lists of tasks that use this pid */
	struct hlist_head tasks[PIDTYPE_MAX];
	struct rcu_head rcu;
        struct {
        	struct pid_namespace *ns;
                pid_t nr;
        } pids[PID_MAX_DEPTH];
};

There are a lot of variations on that theme and it really depends
on the upper level data structures which one we pick.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14 14:12 ` Eric W. Biederman
  2007-03-14 15:03   ` William Lee Irwin III
@ 2007-03-14 15:33   ` Oleg Nesterov
  2007-03-16 10:57     ` Pavel Emelianov
  1 sibling, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2007-03-14 15:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelianov, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Linux Containers

On 03/14, Eric W. Biederman wrote:
> Pavel Emelianov <xemul@sw.ru> writes:
> 
> > Hi.
> >
> > I'm looking at how alloc_pid() works and can't understand
> > one (simple/stupid) thing.
> >
> > It first kmem_cache_alloc()-s a strct pid, then calls
> > alloc_pidmap() and at the end it taks a global pidmap_lock()
> > to add new pid to hash.

We need some global lock. pidmap_lock is already here, and it is
only used to protect pidmap->page allocation. Iow, it is almost
unused. So it was very natural to re-use it while implementing
pidrefs.

> > The question is - why does alloc_pidmap() use at least
> > two atomic ops and potentially loop to find a zero bit
> > in pidmap? Why not call alloc_pidmap() under pidmap_lock
> > and find zero pid in pidmap w/o any loops and atomics?

Currently we search for zero bit lockless, why do you want
to do it under spin_lock ?

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14 15:33   ` Oleg Nesterov
@ 2007-03-16 10:57     ` Pavel Emelianov
  2007-03-16 11:37       ` Eric Dumazet
  2007-03-16 11:40       ` Dmitry Adamushko
  0 siblings, 2 replies; 14+ messages in thread
From: Pavel Emelianov @ 2007-03-16 10:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List, Linux Containers

Oleg Nesterov wrote:
> On 03/14, Eric W. Biederman wrote:
>> Pavel Emelianov <xemul@sw.ru> writes:
>>
>>> Hi.
>>>
>>> I'm looking at how alloc_pid() works and can't understand
>>> one (simple/stupid) thing.
>>>
>>> It first kmem_cache_alloc()-s a strct pid, then calls
>>> alloc_pidmap() and at the end it taks a global pidmap_lock()
>>> to add new pid to hash.
> 
> We need some global lock. pidmap_lock is already here, and it is
> only used to protect pidmap->page allocation. Iow, it is almost
> unused. So it was very natural to re-use it while implementing
> pidrefs.
> 
>>> The question is - why does alloc_pidmap() use at least
>>> two atomic ops and potentially loop to find a zero bit
>>> in pidmap? Why not call alloc_pidmap() under pidmap_lock
>>> and find zero pid in pidmap w/o any loops and atomics?
> 
> Currently we search for zero bit lockless, why do you want
> to do it under spin_lock ?

Search isn't lockless. Look:

while (1) {
   if (!test_and_set_bit(...)) {
       atomic_dec(&nr_free);
       return pid;
   }
   ...
}

we use two atomic operations to find and set a bit in a map.

> Oleg.
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-16 10:57     ` Pavel Emelianov
@ 2007-03-16 11:37       ` Eric Dumazet
  2007-03-16 11:58         ` Pavel Emelianov
  2007-03-16 11:40       ` Dmitry Adamushko
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2007-03-16 11:37 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Oleg Nesterov, Eric W. Biederman, Sukadev Bhattiprolu,
	Serge Hallyn, Linux Kernel Mailing List, Linux Containers

On Friday 16 March 2007 11:57, Pavel Emelianov wrote:
> Oleg Nesterov wrote:
> > On 03/14, Eric W. Biederman wrote:
> >> Pavel Emelianov <xemul@sw.ru> writes:
> >>> Hi.
> >>>
> >>> I'm looking at how alloc_pid() works and can't understand
> >>> one (simple/stupid) thing.
> >>>
> >>> It first kmem_cache_alloc()-s a strct pid, then calls
> >>> alloc_pidmap() and at the end it taks a global pidmap_lock()
> >>> to add new pid to hash.
> >
> > We need some global lock. pidmap_lock is already here, and it is
> > only used to protect pidmap->page allocation. Iow, it is almost
> > unused. So it was very natural to re-use it while implementing
> > pidrefs.
> >
> >>> The question is - why does alloc_pidmap() use at least
> >>> two atomic ops and potentially loop to find a zero bit
> >>> in pidmap? Why not call alloc_pidmap() under pidmap_lock
> >>> and find zero pid in pidmap w/o any loops and atomics?
> >
> > Currently we search for zero bit lockless, why do you want
> > to do it under spin_lock ?
>
> Search isn't lockless. Look:
>
> while (1) {
>    if (!test_and_set_bit(...)) {
>        atomic_dec(&nr_free);
>        return pid;
>    }
>    ...
> }
>
> we use two atomic operations to find and set a bit in a map.

The finding of the zero bit is done without lock. (Search/lookup)

Then , the reservation of the found bit (test_and_set_bit) is done, and 
decrement of nr_free. It may fail because the search was done lockless.

Finding a zero bit in a 4096 bytes array may consume about 6000 cycles on 
modern hardware. Much more on SMP/NUMA machines, or on machines where 
PAGE_SIZE is 64K instead of 4K :)

You don't want to hold pidmad_lock for so long period.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-16 11:37       ` Eric Dumazet
@ 2007-03-16 11:58         ` Pavel Emelianov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Emelianov @ 2007-03-16 11:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Oleg Nesterov, Eric W. Biederman, Sukadev Bhattiprolu,
	Serge Hallyn, Linux Kernel Mailing List, Linux Containers

Eric Dumazet wrote:
> On Friday 16 March 2007 11:57, Pavel Emelianov wrote:
>> Oleg Nesterov wrote:
>>> On 03/14, Eric W. Biederman wrote:
>>>> Pavel Emelianov <xemul@sw.ru> writes:
>>>>> Hi.
>>>>>
>>>>> I'm looking at how alloc_pid() works and can't understand
>>>>> one (simple/stupid) thing.
>>>>>
>>>>> It first kmem_cache_alloc()-s a strct pid, then calls
>>>>> alloc_pidmap() and at the end it taks a global pidmap_lock()
>>>>> to add new pid to hash.
>>> We need some global lock. pidmap_lock is already here, and it is
>>> only used to protect pidmap->page allocation. Iow, it is almost
>>> unused. So it was very natural to re-use it while implementing
>>> pidrefs.
>>>
>>>>> The question is - why does alloc_pidmap() use at least
>>>>> two atomic ops and potentially loop to find a zero bit
>>>>> in pidmap? Why not call alloc_pidmap() under pidmap_lock
>>>>> and find zero pid in pidmap w/o any loops and atomics?
>>> Currently we search for zero bit lockless, why do you want
>>> to do it under spin_lock ?
>> Search isn't lockless. Look:
>>
>> while (1) {
>>    if (!test_and_set_bit(...)) {
>>        atomic_dec(&nr_free);
>>        return pid;
>>    }
>>    ...
>> }
>>
>> we use two atomic operations to find and set a bit in a map.
> 
> The finding of the zero bit is done without lock. (Search/lookup)
> 
> Then , the reservation of the found bit (test_and_set_bit) is done, and 
> decrement of nr_free. It may fail because the search was done lockless.

:\ I do understand how this algorithm works. What I don't
understand is why it is done so, if we take a global lock anyway.

> Finding a zero bit in a 4096 bytes array may consume about 6000 cycles on 
> modern hardware. Much more on SMP/NUMA machines, or on machines where 
> PAGE_SIZE is 64K instead of 4K :)
> 
> You don't want to hold pidmad_lock for so long period.

OK, thanks. That's explanations looks good.

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-16 10:57     ` Pavel Emelianov
  2007-03-16 11:37       ` Eric Dumazet
@ 2007-03-16 11:40       ` Dmitry Adamushko
  1 sibling, 0 replies; 14+ messages in thread
From: Dmitry Adamushko @ 2007-03-16 11:40 UTC (permalink / raw)
  To: Pavel Emelianov; +Cc: Linux Kernel

On 16/03/07, Pavel Emelianov <xemul@sw.ru> wrote:
> Oleg Nesterov wrote:
> > On 03/14, Eric W. Biederman wrote:
> >> Pavel Emelianov <xemul@sw.ru> writes:
> >>
> >>> Hi.
> >>>
> >>> I'm looking at how alloc_pid() works and can't understand
> >>> one (simple/stupid) thing.
> >>>
> >>> It first kmem_cache_alloc()-s a strct pid, then calls
> >>> alloc_pidmap() and at the end it taks a global pidmap_lock()
> >>> to add new pid to hash.
> >
> > We need some global lock. pidmap_lock is already here, and it is
> > only used to protect pidmap->page allocation. Iow, it is almost
> > unused. So it was very natural to re-use it while implementing
> > pidrefs.
> >
> >>> The question is - why does alloc_pidmap() use at least
> >>> two atomic ops and potentially loop to find a zero bit
> >>> in pidmap? Why not call alloc_pidmap() under pidmap_lock
> >>> and find zero pid in pidmap w/o any loops and atomics?
> >
> > Currently we search for zero bit lockless, why do you want
> > to do it under spin_lock ?
>
> Search isn't lockless. Look:
>
> while (1) {
>    if (!test_and_set_bit(...)) {
>        atomic_dec(&nr_free);
>        return pid;
>    }
> we use two atomic operations to find and set a bit in a map.

While you may have a few concurrent threads competing for the same
"offset" and "pid" in the loop - e.g. at point [1] (see below), only
one will succeed with "registering" it due to the atomicity of
test_and_set_bit() and so only this one will get at point [2] with the
"pid".

The rest of the "unlucky" threads will either

(i) compete for another "offset" -> "pid" (as described above);
(ii) leave the loop when one of the conditions of while() becomes
"false" -> e.g. there are no more free slots in this map.

                if (likely(atomic_read(&map->nr_free))) {
                        do {
                                // [1]
                                if (!test_and_set_bit(offset, map->page)) {
                                // [2]
                                        atomic_dec(&map->nr_free);
                                        pid_ns->last_pid = pid;
                                        return pid;
                                }
                                offset = find_next_offset(map, offset);
                                pid = mk_pid(pid_ns, map, offset);
                        /*
                         * find_next_offset() found a bit, the pid from it
                         * is in-bounds, and if we fell back to the last
                         * bitmap block and the final block was the same
                         * as the starting point, pid is before last_pid.
                         */
                        } while (offset < BITS_PER_PAGE && pid < pid_max &&
                                        (i != max_scan || pid < last ||
                                            !((last+1) & BITS_PER_PAGE_MASK)));
                }

>    ...

>
> > Oleg.
> >
> >

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] kernel/pid.c pid allocation wierdness
  2007-03-14  7:30 [RFC] kernel/pid.c pid allocation wierdness Pavel Emelianov
  2007-03-14 14:12 ` Eric W. Biederman
@ 2007-03-14 14:43 ` William Lee Irwin III
  1 sibling, 0 replies; 14+ messages in thread
From: William Lee Irwin III @ 2007-03-14 14:43 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Eric W. Biederman, Sukadev Bhattiprolu, Serge Hallyn,
	Linux Kernel Mailing List

On Wed, Mar 14, 2007 at 10:30:59AM +0300, Pavel Emelianov wrote:
> I'm looking at how alloc_pid() works and can't understand
> one (simple/stupid) thing.
> It first kmem_cache_alloc()-s a strct pid, then calls
> alloc_pidmap() and at the end it taks a global pidmap_lock()
> to add new pid to hash.
> The question is - why does alloc_pidmap() use at least
> two atomic ops and potentially loop to find a zero bit
> in pidmap? Why not call alloc_pidmap() under pidmap_lock
> and find zero pid in pidmap w/o any loops and atomics?
> The same is for free_pid(). Do I miss something?

pidmap_lock protects the ->page elements of the pidmap array. The
bitmap is not protected by it. It was intended to be as lockless as
possible, so the lock there essentially stands in for a cmpxchg().

A loop of some kind is strictly necessary regardless; in this lockless
case concurrent bitmap updates can trigger looping. It's very important
that only O(1) operations happen under the lock. These operations are
installing freshly-allocated pidmap pages, inserting a struct pid into
a hashtable collision chain, and removing a struct pid from a hashtable
collision chain. Traversals of hashtable collision chains are lockless
as per RCU. In any event, the atomic bit operations allow purely
lockless bitmap updates as well as purely lockless bitmap reads.

Essentially the idioms you're noticing are all for SMP scalability; in
particular, pid allocation used to cause enormous stress on tasklist_lock
that would trigger NMI-based deadlock detectors. Backing out such
optimizations is tantamount to making the systems affected by that (i.e.
any with enough cpus) crash that way again.

-- wli

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-03-16 21:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-14  7:30 [RFC] kernel/pid.c pid allocation wierdness Pavel Emelianov
2007-03-14 14:12 ` Eric W. Biederman
2007-03-14 15:03   ` William Lee Irwin III
2007-03-14 16:54     ` Eric W. Biederman
2007-03-15 20:26       ` William Lee Irwin III
2007-03-16 13:04         ` Eric W. Biederman
2007-03-16 19:46           ` William Lee Irwin III
2007-03-16 21:18             ` Eric W. Biederman
2007-03-14 15:33   ` Oleg Nesterov
2007-03-16 10:57     ` Pavel Emelianov
2007-03-16 11:37       ` Eric Dumazet
2007-03-16 11:58         ` Pavel Emelianov
2007-03-16 11:40       ` Dmitry Adamushko
2007-03-14 14:43 ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox