getting processor numbers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* getting processor numbers
@ 2007-04-03 16:54 Ulrich Drepper
  2007-04-03 17:30 ` linux-os (Dick Johnson)
                   ` (4 more replies)
  0 siblings, 5 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 16:54 UTC (permalink / raw)
  To: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

More and more code depends on knowing the number of processors in the
system to efficiently scale the code.  E.g., in OpenMP it is used by
default to determine how many threads to create.  Creating more threads
than there are processors/cores doesn't make sense.

glibc for a long time provides functionality to retrieve the number
through sysconf() and this is what fortunately most programs use.  The
problem is that we are currently using /proc/cpuinfo since this is all
there was available at that time.  Creating /proc/cpuinfo takes the
kernel quite a long time, unfortunately (I think Jakub said it is mainly
the interrupt information).

The alternative today is to use /sys/devices/system/cpu and count the
number of cpu* directories in it.  This is somewhat faster.  But there
would be another possibility: simply stat /sys/devices/system/cpu and
use st_nlink - 2.

This last step unfortunately it made impossible by recent changes:

  http://article.gmane.org/gmane.linux.kernel/413178

I would like to propose changing that patch, move the sched_*
pseudo-files in some other directly and permanently ban putting any new
file into /sys/devices/system/cpu.

To get some numbers, you can try

  http://people.redhat.com/drepper/nproc-timing.c

The numbers I see on x86-64:

cpuinfo 10145810 cycles for 100 accesses
readdir /sys 3113870 cycles for 100 accesses
stat /sys 741070 cycles for 100 accesses

Note that for the first two methods I skipped the actual parsing part.
This means in the real solution the gap between those two and the simple
stat() call is even bigger.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 16:54 getting processor numbers Ulrich Drepper
@ 2007-04-03 17:30 ` linux-os (Dick Johnson)
  2007-04-03 17:37   ` Ulrich Drepper
  2007-04-03 17:56 ` Dr. David Alan Gilbert
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: linux-os (Dick Johnson) @ 2007-04-03 17:30 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel, Andrew Morton


On Tue, 3 Apr 2007, Ulrich Drepper wrote:

> More and more code depends on knowing the number of processors in the
> system to efficiently scale the code.  E.g., in OpenMP it is used by
> default to determine how many threads to create.  Creating more threads
> than there are processors/cores doesn't make sense.
>
> glibc for a long time provides functionality to retrieve the number
> through sysconf() and this is what fortunately most programs use.  The
> problem is that we are currently using /proc/cpuinfo since this is all
> there was available at that time.  Creating /proc/cpuinfo takes the
> kernel quite a long time, unfortunately (I think Jakub said it is mainly
> the interrupt information).
>
> The alternative today is to use /sys/devices/system/cpu and count the
> number of cpu* directories in it.  This is somewhat faster.  But there
> would be another possibility: simply stat /sys/devices/system/cpu and
> use st_nlink - 2.
>
> This last step unfortunately it made impossible by recent changes:
>
>  http://article.gmane.org/gmane.linux.kernel/413178
>
> I would like to propose changing that patch, move the sched_*
> pseudo-files in some other directly and permanently ban putting any new
> file into /sys/devices/system/cpu.
>
> To get some numbers, you can try
>
>  http://people.redhat.com/drepper/nproc-timing.c
>
> The numbers I see on x86-64:
>
> cpuinfo 10145810 cycles for 100 accesses
> readdir /sys 3113870 cycles for 100 accesses
> stat /sys 741070 cycles for 100 accesses
>
> Note that for the first two methods I skipped the actual parsing part.
> This means in the real solution the gap between those two and the simple
> stat() call is even bigger.
>
> -- 
> ÿÿ Ulrich Drepper ÿÿ Red Hat, Inc. ÿÿ 444 Castro St ÿÿ Mountain View, CA ÿÿ
>
>

Shouldn't it just be another system call? 223 is currently unused. You
could fill that up with __NR_nr_cpus. The value already exists in
the kernel.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.65 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:30 ` linux-os (Dick Johnson)
@ 2007-04-03 17:37   ` Ulrich Drepper
  0 siblings, 0 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 17:37 UTC (permalink / raw)
  To: linux-os (Dick Johnson); +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 540 bytes --]

linux-os (Dick Johnson) wrote:
> Shouldn't it just be another system call? 223 is currently unused. You
> could fill that up with __NR_nr_cpus. The value already exists in
> the kernel.

You forget about Linus' credo "there shall be no sysconf-like syscall".
 I'd be all for sys_sysconf or even the limited sys_nr_cpus although
ideally then we'd have two syscalls (probed CPUs, active CPUs, in which
case sys_sysconf is the better choice).

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 16:54 getting processor numbers Ulrich Drepper
  2007-04-03 17:30 ` linux-os (Dick Johnson)
@ 2007-04-03 17:56 ` Dr. David Alan Gilbert
  2007-04-03 18:11 ` Andi Kleen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Dr. David Alan Gilbert @ 2007-04-03 17:56 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel, Andrew Morton

* Ulrich Drepper (drepper@redhat.com) wrote:

> glibc for a long time provides functionality to retrieve the number
> through sysconf() and this is what fortunately most programs use.  The
> problem is that we are currently using /proc/cpuinfo since this is all
> there was available at that time.  Creating /proc/cpuinfo takes the
> kernel quite a long time, unfortunately (I think Jakub said it is mainly
> the interrupt information).

It's not only expensive to create, it's expensive and annoying to parse;
I don't think it is vaguely consistent accross different architectures
not to mention kernel versions.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 16:54 getting processor numbers Ulrich Drepper
  2007-04-03 17:30 ` linux-os (Dick Johnson)
  2007-04-03 17:56 ` Dr. David Alan Gilbert
@ 2007-04-03 18:11 ` Andi Kleen
  2007-04-03 17:17   ` Ulrich Drepper
  2007-04-03 19:15 ` Davide Libenzi
  2007-04-03 20:16 ` Andrew Morton
  4 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2007-04-03 18:11 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel, Andrew Morton

Ulrich Drepper <drepper@redhat.com> writes:

> More and more code depends on knowing the number of processors in the
> system to efficiently scale the code.  E.g., in OpenMP it is used by
> default to determine how many threads to create.

There are more uses for it.

>  Creating more threads
> than there are processors/cores doesn't make sense.

There was a proposal some time ago to put that into the ELF aux vector
Unfortunately there was disagreement on what information to put 
there exactly (full topology, only limited numbers etc.) 

My proposal was number of CPUs, number of cores, number of nodes
as three 16 bit numbers.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 18:11 ` Andi Kleen
@ 2007-04-03 17:17   ` Ulrich Drepper
  2007-04-03 17:22     ` Alan Cox
  2007-04-03 17:27     ` Andi Kleen
  0 siblings, 2 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 17:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 661 bytes --]

Andi Kleen wrote:
> There was a proposal some time ago to put that into the ELF aux vector
> Unfortunately there was disagreement on what information to put 
> there exactly (full topology, only limited numbers etc.) 

Topology, yes, I'm likely in favor of it.

Processor number: no.  Unless you want to rip out hotpluging.  I'm
certainly in favor of that, it creates huge problems for  no real
benefit for the common use cases.  But as it is, the number of
processors is not necessarily constant over the lifetime of a process.
The machine architecture is.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:17   ` Ulrich Drepper
@ 2007-04-03 17:22     ` Alan Cox
  2007-04-03 17:30       ` Andi Kleen
  2007-04-03 17:27     ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: Alan Cox @ 2007-04-03 17:22 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

> benefit for the common use cases.  But as it is, the number of
> processors is not necessarily constant over the lifetime of a process.
> The machine architecture is.

Not once you have migration capable virtualisation it isnt.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:22     ` Alan Cox
@ 2007-04-03 17:30       ` Andi Kleen
  2007-04-03 20:24         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2007-04-03 17:30 UTC (permalink / raw)
  To: Alan Cox; +Cc: Ulrich Drepper, Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 06:22:41PM +0100, Alan Cox wrote:
> > benefit for the common use cases.  But as it is, the number of
> > processors is not necessarily constant over the lifetime of a process.
> > The machine architecture is.
> 
> Not once you have migration capable virtualisation it isnt.

Migration is fundamentally incompatible with many CPU optimizations.
But that's not a reason to not optimize anymore.

But I guess luckily most migration users will be able to live
with a little decreased performance after it.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:30       ` Andi Kleen
@ 2007-04-03 20:24         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-03 20:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, Ulrich Drepper, Linux Kernel, Andrew Morton

Andi Kleen wrote:
> Migration is fundamentally incompatible with many CPU optimizations.
> But that's not a reason to not optimize anymore.
>   

I've been thinking about ways in which Xen could provide the current
vcpu->cpu map to guest domains.  Obviously this would change over time,
but it could remain current enough to be useful.

> But I guess luckily most migration users will be able to live
> with a little decreased performance after it.
>   

At least in the Xen case, the source and target machines need to be
fairly similar in architecture (can't deal with vastly different CPU
types, for example).

    J

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:17   ` Ulrich Drepper
  2007-04-03 17:22     ` Alan Cox
@ 2007-04-03 17:27     ` Andi Kleen
  2007-04-03 17:30       ` Ulrich Drepper
  1 sibling, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2007-04-03 17:27 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 10:17:15AM -0700, Ulrich Drepper wrote:
> Andi Kleen wrote:
> > There was a proposal some time ago to put that into the ELF aux vector
> > Unfortunately there was disagreement on what information to put 
> > there exactly (full topology, only limited numbers etc.) 
> 
> Topology, yes, I'm likely in favor of it.

What topology and what use case? 

> Processor number: no.  Unless you want to rip out hotpluging.  I'm

Topology is dependent on the number of CPUs.

Hot plugging is a completely orthogonal problem. Even your original
proposal wouldn't address it. Mine doesn't neither, because i suspect
most programs won't care. If it's addressed it could work
on top of it. The aux vector to get the information quickly at 
program startup and later updates can get it from /sys.

If some program starts caring we would need to implement some notification
mechanism (that would be possible), but it might be hard to fit
into glibc because you don't have a event loop.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:27     ` Andi Kleen
@ 2007-04-03 17:30       ` Ulrich Drepper
  2007-04-03 17:35         ` Andi Kleen
  2007-04-03 17:44         ` Siddha, Suresh B
  0 siblings, 2 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 17:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

Andi Kleen wrote:
> Topology is dependent on the number of CPUs.

Not all of it.


> Hot plugging is a completely orthogonal problem. Even your original
> proposal wouldn't address it.

Nonsense.  Reading /proc/cpuinfo or /sys/devices/system/cpu reflects the
current CPU count.  The information read would (and is not) cached, it's
re-read every time.  We might add very limited caching (for a few
seconds) but that's as much as we can go.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:30       ` Ulrich Drepper
@ 2007-04-03 17:35         ` Andi Kleen
  2007-04-03 17:45           ` Ulrich Drepper
  2007-04-03 17:44         ` Siddha, Suresh B
  1 sibling, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2007-04-03 17:35 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 10:30:47AM -0700, Ulrich Drepper wrote:
> Andi Kleen wrote:
> > Topology is dependent on the number of CPUs.
> 
> Not all of it.

What is not?

>  We might add very limited caching (for a few
> seconds) but that's as much as we can go.

Hmm, e.g. in OpenMP you would have another thread that just reads /proc/cpuinfo
in a loop and starts new threads on new CPUs?

That sounds ...... "expensive" 

The other use case in glibc I know of is the Opteron optimized
memcpy which can use different functions depending on the number of
cores.  But having a separate thread regularly rereading cpuinfo for
memcpy also sounds quite crazy.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:35         ` Andi Kleen
@ 2007-04-03 17:45           ` Ulrich Drepper
  2007-04-03 17:58             ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 17:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1198 bytes --]

Andi Kleen wrote:
>>> Topology is dependent on the number of CPUs.
>> Not all of it.
> 
> What is not?

Memory banks can exist without a CPU present.  The places where you can
plug in memory don't change and so the memory hierarchy can be described.

> Hmm, e.g. in OpenMP you would have another thread that just reads /proc/cpuinfo
> in a loop and starts new threads on new CPUs?
> 
> That sounds ...... "expensive" 

That's the cost of doing business.

There is an inexpensive solution: finally make the vdso concept a bit
more flexible.  You could add a vdso call to get the processor count.
The vdso code itself can use a data page mapped in from the kernel.
This page (read-only at userlevel) would contain global information such
as processor count and topology.

But we're getting IMO off topic here.  That's a separate and far more
complicated issue.

Here we now have the concrete issue that determining the CPU count is
terribly expensive and there is a simple proposal to make it faster by
keeping /sys/devices/system/cpu/ free from anything but cpu* directories.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:45           ` Ulrich Drepper
@ 2007-04-03 17:58             ` Andi Kleen
  2007-04-03 18:05               ` Ulrich Drepper
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2007-04-03 17:58 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 10:45:35AM -0700, Ulrich Drepper wrote:
> Andi Kleen wrote:
> >>> Topology is dependent on the number of CPUs.
> >> Not all of it.
> > 
> > What is not?
> 
> Memory banks can exist without a CPU present.  The places where you can
> plug in memory don't change and so the memory hierarchy can be described.

There are systems that support node hotplug, like Altix or the larger
IBM or Unisys x86 systems. Basically they are a bunch of
smaller systems connected with cables running a cache coherent network
protocol.  With that memory can appear (and disappear but we don't handle 
that yet) unexpectedly.

That said we might have some idea of that in advance -- it can
be described in ACPI SRAT -- but the trouble is that many quite
ordinary x86 server systems always describe hotplug zones even though
it is unlikely they will ever get any. Trusting that can be quite
inefficient.

> There is an inexpensive solution: finally make the vdso concept a bit
> more flexible.  You could add a vdso call to get the processor count.
> The vdso code itself can use a data page mapped in from the kernel.

The ELF aux vector is exactly that already.

> This page (read-only at userlevel) would contain global information such
> as processor count and topology.

You would still need an event notification mechanism, won't you?
> 
> 
> But we're getting IMO off topic here.  That's a separate and far more
> complicated issue.

Yes I agree. Hotplug is best ignored for now

> Here we now have the concrete issue that determining the CPU count is
> terribly expensive and there is a simple proposal to make it faster by
> keeping /sys/devices/system/cpu/ free from anything but cpu* directories.

The cost will be still large. Accessing sysfs will be never cheap.
For once anything going through the VFS tens to take a two sometimes
three digit number of locks.

If you want it cheap look for some other way.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:58             ` Andi Kleen
@ 2007-04-03 18:05               ` Ulrich Drepper
  2007-04-03 18:11                 ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 18:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1405 bytes --]

Andi Kleen wrote:
>> There is an inexpensive solution: finally make the vdso concept a bit
>> more flexible.  You could add a vdso call to get the processor count.
>> The vdso code itself can use a data page mapped in from the kernel.
> 
> The ELF aux vector is exactly that already.

No.  The aux vector cannot be changed after the process is changed.  The
memory belong to the process and not the kernel.  It must be possible at
any time to get the correct information even if the system changed.

>> This page (read-only at userlevel) would contain global information such
>> as processor count and topology.
> 
> You would still need an event notification mechanism, won't you?

No, why?  The vdso call would be so inexpensive (just a simple function
call) that it can be done whenever a topology-based decision has to be
made.  Use cookies to determine whether nothing has been changed since
the last call etc.

> The cost will be still large. Accessing sysfs will be never cheap.
> For once anything going through the VFS tens to take a two sometimes
> three digit number of locks.

That stat solution actually ain't that bad.  It takes ~7400 cycles on my
machine.

> If you want it cheap look for some other way.

Well, who's brace enough to submit sys_sysconf() again?

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 18:05               ` Ulrich Drepper
@ 2007-04-03 18:11                 ` Andi Kleen
  2007-04-03 18:21                   ` Ulrich Drepper
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2007-04-03 18:11 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 11:05:49AM -0700, Ulrich Drepper wrote:
> Andi Kleen wrote:
> >> There is an inexpensive solution: finally make the vdso concept a bit
> >> more flexible.  You could add a vdso call to get the processor count.
> >> The vdso code itself can use a data page mapped in from the kernel.
> > 
> > The ELF aux vector is exactly that already.
> 
> No.  The aux vector cannot be changed after the process is changed.  The
> memory belong to the process and not the kernel.  It must be possible at
> any time to get the correct information even if the system changed.

That's probably debatable, but ok.

I would be opposed for adding another page per process at least
because the per process memory footprint in Linux is imho already
too large.
 
> >> This page (read-only at userlevel) would contain global information such
> >> as processor count and topology.
> > 
> > You would still need an event notification mechanism, won't you?
> 
> No, why?  The vdso call would be so inexpensive (just a simple function
> call) that it can be done whenever a topology-based decision has to be
> made.  Use cookies to determine whether nothing has been changed since
> the last call etc.

But how would that mix with the OpenMP use case where you have
thread pools that normally don't make decisions afer startup, but 
just stay around? I think for those you would need events of some sort
to start or remove threads as needed. 

Asking the kernel every time you submit some work to the threads
would probably not fly.
 
> > If you want it cheap look for some other way.
> 
> Well, who's brace enough to submit sys_sysconf() again?

If there's a good use case fine for me. However I suspect it's
either "slow is ok" or "want it very fast" where even a syscall
would hurt.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 18:11                 ` Andi Kleen
@ 2007-04-03 18:21                   ` Ulrich Drepper
  0 siblings, 0 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 18:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2049 bytes --]

Andi Kleen wrote:
> I would be opposed for adding another page per process at least
> because the per process memory footprint in Linux is imho already
> too large.

That's a single page shared by all threads on the system.  Or make this
a page per NUMA node.  And if the number of threads is larger than the
share counter for page, create a few more.

But in general the extra overhead would be minimal from the memory
consumption POV.

>> No, why?  The vdso call would be so inexpensive (just a simple function
>> call) that it can be done whenever a topology-based decision has to be
>> made.  Use cookies to determine whether nothing has been changed since
>> the last call etc.
> 
> But how would that mix with the OpenMP use case where you have
> thread pools that normally don't make decisions afer startup, but 
> just stay around?

There is a different between having threads in the thread pool and
actually using them.  For every #omp loop the number of processors is
checked again and this is the number of threads from the pool which is used.

> I think for those you would need events of some sort
> to start or remove threads as needed. 

We need no events if determining the number of processors is cheap.
There is really no reason why it shouldn't.  Restarting all the threads
is not the cheapest operation so a single syscall (sys_sysconf, etc)
does not increase the cost a lot.

> If there's a good use case fine for me. However I suspect it's
> either "slow is ok" or "want it very fast" where even a syscall
> would hurt.

Ideally, as I said, an optimized vdso call is best.  But a syscall is
OK.  The nice thing about the vdso is that for now one could simply
implement it using a syscall and in future add optimizations to avoid
the kernel entry if possible.

A single syscall is two order of magnitude better than the best solution
available today.  This is my main concern right now.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:30       ` Ulrich Drepper
  2007-04-03 17:35         ` Andi Kleen
@ 2007-04-03 17:44         ` Siddha, Suresh B
  2007-04-03 17:59           ` Ulrich Drepper
  2007-04-03 19:55           ` Ulrich Drepper
  1 sibling, 2 replies; 56+ messages in thread
From: Siddha, Suresh B @ 2007-04-03 17:44 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 10:30:47AM -0700, Ulrich Drepper wrote:
> Reading /proc/cpuinfo or /sys/devices/system/cpu reflects the
> current CPU count.

Not all of the cpu* directories in /sys/devices/system/cpu may be
online.

thanks,
suresh	

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:44         ` Siddha, Suresh B
@ 2007-04-03 17:59           ` Ulrich Drepper
  2007-04-03 19:40             ` Jakub Jelinek
  2007-04-03 20:13             ` Ingo Oeser
  2007-04-03 19:55           ` Ulrich Drepper
  1 sibling, 2 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 17:59 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Andi Kleen, Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 778 bytes --]

Siddha, Suresh B wrote:
> Not all of the cpu* directories in /sys/devices/system/cpu may be
> online.

Brilliant.  You people really know how to create user interfaces.  So
now in any case per CPU another stat/access syscall is needed to check
the 'online' pseudo-file?  With this the readdir solution is only
marginally faster than parsing /proc/cpuinfo which means, it's
unacceptably slow.

I cannot believe all these big system people are allowed to screw
everybody else up with their nonsense.

So, anybody else has a proposal?  This is a pressing issue and cannot
wait until someday in the distant future NUMA topology information is
easily and speedily accessible.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:59           ` Ulrich Drepper
@ 2007-04-03 19:40             ` Jakub Jelinek
  2007-04-03 20:13             ` Ingo Oeser
  1 sibling, 0 replies; 56+ messages in thread
From: Jakub Jelinek @ 2007-04-03 19:40 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Siddha, Suresh B, Andi Kleen, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 10:59:53AM -0700, Ulrich Drepper wrote:
> Siddha, Suresh B wrote:
> > Not all of the cpu* directories in /sys/devices/system/cpu may be
> > online.
> 
> Brilliant.  You people really know how to create user interfaces.  So
> now in any case per CPU another stat/access syscall is needed to check
> the 'online' pseudo-file?  With this the readdir solution is only
> marginally faster than parsing /proc/cpuinfo which means, it's
> unacceptably slow.

Note that glibc actually parses /proc/stat in preference of /proc/cpuinfo
ATM, because /proc/stat is at least uniform while parsing /proc/cpuinfo
needs a special parser for each architecture.  And /proc/stat reading
is even slower than /proc/cpuinfo, on x86_64 reading/parsing /proc/stat
takes about 450usec, while e.g. stat64 on /sys/devices/system/cpu
is just 2.5usec.
But if that can't be trusted as the number of online CPUs, can
somebody please add a short file to proc or sysfs which will contain
the number of online and number of configured CPUs?

See e.g. http://openmp.org/pipermail/omp/2007/000714.html
where the first time after second g++ invocation is with omp_set_dynamic (1)
and ought to be about as fast as omp_set_dynamic (0) case with the same
number of threads, but it is far slower due to slow
sysconf (_SC_NPROCESSORS_ONLN).

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:59           ` Ulrich Drepper
  2007-04-03 19:40             ` Jakub Jelinek
@ 2007-04-03 20:13             ` Ingo Oeser
  2007-04-03 23:38               ` J.A. Magallón
  1 sibling, 1 reply; 56+ messages in thread
From: Ingo Oeser @ 2007-04-03 20:13 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Siddha, Suresh B, Andi Kleen, Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]

Hi Ulrich,

On Tuesday 03 April 2007, Ulrich Drepper wrote:
> So, anybody else has a proposal?  This is a pressing issue and cannot
> wait until someday in the distant future NUMA topology information is
> easily and speedily accessible.

Since for now you just need a fast and dirty hack, which will be replaced 
with better interfaces, I suggest creating a directory with some files in it.
These should just contain, what you need to handle your most pressing cases.

I propose /sys/devices/system/topology_counters/ for that.
These can contain "online_cpu", "proped_cpu", "max_cpu"
and maybe the same for nodes. All that as a simple file with an integer
value.

Since sysfs-attribute files are pollable (if the owners notifies sysfs 
on changes), you also have the notification system you need 
(select, poll, epoll etc.).

If you promise to just keep the slow code around, than one day when the shiny 
NUMA topology stuff is ready, this directory can be completely removed and
glibc (plus all their users) keeps working. It will then even work better with a 
new glibc version, which supports the shiny new NUMA topology stuff.

The kernel can create these counters quiete easy, since most of them are 
the hamming weight (or population count) of some bitmaps.

Does this sound like a proper hacky solution? :-)

Regards

Ingo Oeser

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 20:13             ` Ingo Oeser
@ 2007-04-03 23:38               ` J.A. Magallón
  0 siblings, 0 replies; 56+ messages in thread
From: J.A. Magallón @ 2007-04-03 23:38 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Ulrich Drepper, Siddha, Suresh B, Andi Kleen, Linux Kernel,
	Andrew Morton

On Tue, 3 Apr 2007 22:13:07 +0200, Ingo Oeser <ioe-lkml@rameria.de> wrote:

> Hi Ulrich,
> 
> On Tuesday 03 April 2007, Ulrich Drepper wrote:
> > So, anybody else has a proposal?  This is a pressing issue and cannot
> > wait until someday in the distant future NUMA topology information is
> > easily and speedily accessible.
> 
> Since for now you just need a fast and dirty hack, which will be replaced 
> with better interfaces, I suggest creating a directory with some files in it.
> These should just contain, what you need to handle your most pressing cases.
> 
> I propose /sys/devices/system/topology_counters/ for that.
> These can contain "online_cpu", "proped_cpu", "max_cpu"
> and maybe the same for nodes. All that as a simple file with an integer
> value.
> 
> Since sysfs-attribute files are pollable (if the owners notifies sysfs 
> on changes), you also have the notification system you need 
> (select, poll, epoll etc.).
> 
> If you promise to just keep the slow code around, than one day when the shiny 
> NUMA topology stuff is ready, this directory can be completely removed and
> glibc (plus all their users) keeps working. It will then even work better with a 
> new glibc version, which supports the shiny new NUMA topology stuff.
> 
> The kernel can create these counters quiete easy, since most of them are 
> the hamming weight (or population count) of some bitmaps.
> 
> Does this sound like a proper hacky solution? :-)
> 

Just a point of view from someone who has to parse /proc/cpuinfo.
That sort of file tree thing is useful to work from the command line but its
a kick in the a** to use from a program.
This makes you just to re-parse the tree each time you have to get the info
(open, read, atoi, close...) to fill your internal variables.

I don't know if its possible, but I would like something like:

__packt_it_tight_please struct cpumap_t {
	u16	ncpus;
	u16	ncpus_onln;
	u16	ncpus_inmyset;  // for procsets
	// Here possibly more info about topology, pack-core-thread structure...
	// in simple arrays...
};

struct cpumap_t *cpumap = mmap("/proc/sys/hw/cpumap",sizeof(struct cpumap_t));

for (...cpumap->ncpus_inmyset ....)  // 

As I said, I don't know if its possible. I vaguely remember some comments
against binary info in /proc...

Even it could be simplified if you realize some things:
- Usually people dont worry about if cpus are all the online ones or I'm
  running in a proc set. Just want to know how many can I use.
- Don't care if they are hyper-threaded, cores os independent processors.
  To adjust processing for hyper-threaded cpus, one needs to tie processes
  to processors, and you need to be root for that.
  Really, anything dependent on topology is not usable for normal programs,
  because you need to be root to control that.
  So topology is not so important.

Some (probably stupid) ideas...

--
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free
Mandriva Linux release 2007.1 (Cooker) for i586
Linux 2.6.20-jam08 (gcc 4.1.2 20070302 (prerelease) (4.1.2-1mdv2007.1)) #1 SMP PREEMPT

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 17:44         ` Siddha, Suresh B
  2007-04-03 17:59           ` Ulrich Drepper
@ 2007-04-03 19:55           ` Ulrich Drepper
  2007-04-03 20:13             ` Siddha, Suresh B
  2007-04-03 20:20             ` Nathan Lynch
  1 sibling, 2 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 19:55 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

Siddha, Suresh B wrote:a
> Not all of the cpu* directories in /sys/devices/system/cpu may be
> online.

Apparently this information isn't needed.  It's very easy to verify:

$ ls /sys/devices/system/cpu/*/online
/sys/devices/system/cpu/cpu1/online  /sys/devices/system/cpu/cpu2/online
 /sys/devices/system/cpu/cpu3/online

This is a quad core machine and cpu0 doesn't have the 'online' file
(2.6.19 kernel).  So, if nobody noticed this it's not needed and we can
just remove CPUs from /sys/devices/system/cpu when they are brought
offline, right?

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 19:55           ` Ulrich Drepper
@ 2007-04-03 20:13             ` Siddha, Suresh B
  2007-04-03 20:19               ` Ulrich Drepper
  2007-04-03 20:20             ` Nathan Lynch
  1 sibling, 1 reply; 56+ messages in thread
From: Siddha, Suresh B @ 2007-04-03 20:13 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Siddha, Suresh B, Linux Kernel, Andrew Morton

On Tue, Apr 03, 2007 at 12:55:22PM -0700, Ulrich Drepper wrote:
> Siddha, Suresh B wrote:a
> > Not all of the cpu* directories in /sys/devices/system/cpu may be
> > online.
> 
> Apparently this information isn't needed.  It's very easy to verify:
> 
> $ ls /sys/devices/system/cpu/*/online
> /sys/devices/system/cpu/cpu1/online  /sys/devices/system/cpu/cpu2/online
>  /sys/devices/system/cpu/cpu3/online
> 
> This is a quad core machine and cpu0 doesn't have the 'online' file
> (2.6.19 kernel).

I think that is expected and intentional, as the current cpu hotplug code
doesn't support offlining cpu0.

>  So, if nobody noticed this it's not needed and we can
> just remove CPUs from /sys/devices/system/cpu when they are brought
> offline, right?

No. Logical cpu hotplug uses these interfaces to make a cpu go offline
and online.

thanks,
suresh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 20:13             ` Siddha, Suresh B
@ 2007-04-03 20:19               ` Ulrich Drepper
  2007-04-03 20:32                 ` Eric Dumazet
  0 siblings, 1 reply; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 20:19 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

Siddha, Suresh B wrote:
> No. Logical cpu hotplug uses these interfaces to make a cpu go offline
> and online.

You missed my sarcasms, email is bad for conveying it.  The point is
nobody really cares about that hotplug nonsense to have noticed the bug.
 And still does this nonsense prevent real problems from being addressed.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 20:19               ` Ulrich Drepper
@ 2007-04-03 20:32                 ` Eric Dumazet
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Dumazet @ 2007-04-03 20:32 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Siddha, Suresh B, Linux Kernel, Andrew Morton

Ulrich Drepper a écrit :
> Siddha, Suresh B wrote:
>> No. Logical cpu hotplug uses these interfaces to make a cpu go offline
>> and online.
> 
> You missed my sarcasms, email is bad for conveying it.  The point is
> nobody really cares about that hotplug nonsense to have noticed the bug.
>  And still does this nonsense prevent real problems from being addressed.
> 

Please dont focus on /sys being your holy gral.

1) AFAIK /sys/devices/system/cpu was not designed to meet glibc needs.

2) Many production machines dont mount /sys at all

$ uname -r
2.6.20
$ ls -al /sys/devices/system/cpu
ls: /sys/devices/system/cpu: No such file or directory
$ grep processor /proc/cpuinfo
processor       : 0
processor       : 1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 19:55           ` Ulrich Drepper
  2007-04-03 20:13             ` Siddha, Suresh B
@ 2007-04-03 20:20             ` Nathan Lynch
  1 sibling, 0 replies; 56+ messages in thread
From: Nathan Lynch @ 2007-04-03 20:20 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Siddha, Suresh B, Linux Kernel, Andrew Morton

Ulrich Drepper wrote:
> Siddha, Suresh B wrote:a
> > Not all of the cpu* directories in /sys/devices/system/cpu may be
> > online.
> 
> Apparently this information isn't needed.  It's very easy to verify:
> 
> $ ls /sys/devices/system/cpu/*/online
> /sys/devices/system/cpu/cpu1/online  /sys/devices/system/cpu/cpu2/online
>  /sys/devices/system/cpu/cpu3/online
> 
> This is a quad core machine and cpu0 doesn't have the 'online' file
> (2.6.19 kernel).  So, if nobody noticed this it's not needed and we can
> just remove CPUs from /sys/devices/system/cpu when they are brought
> offline, right?

No... the online sysfs files are used to show and change cpus'
online/offline state.  You wouldn't be able to bring an offlined cpu
back online again.

cpu0 doesn't have an online file on machines which don't support
offlining of the boot cpu.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 16:54 getting processor numbers Ulrich Drepper
                   ` (2 preceding siblings ...)
  2007-04-03 18:11 ` Andi Kleen
@ 2007-04-03 19:15 ` Davide Libenzi
  2007-04-03 19:32   ` Ulrich Drepper
  2007-04-03 20:16 ` Andrew Morton
  4 siblings, 1 reply; 56+ messages in thread
From: Davide Libenzi @ 2007-04-03 19:15 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel, Andrew Morton

On Tue, 3 Apr 2007, Ulrich Drepper wrote:

> More and more code depends on knowing the number of processors in the
> system to efficiently scale the code.  E.g., in OpenMP it is used by
> default to determine how many threads to create.  Creating more threads
> than there are processors/cores doesn't make sense.
> 
> glibc for a long time provides functionality to retrieve the number
> through sysconf() and this is what fortunately most programs use.  The
> problem is that we are currently using /proc/cpuinfo since this is all
> there was available at that time.  Creating /proc/cpuinfo takes the
> kernel quite a long time, unfortunately (I think Jakub said it is mainly
> the interrupt information).
> 
> The alternative today is to use /sys/devices/system/cpu and count the
> number of cpu* directories in it.  This is somewhat faster.  But there
> would be another possibility: simply stat /sys/devices/system/cpu and
> use st_nlink - 2.
> 
> This last step unfortunately it made impossible by recent changes:
> 
>   http://article.gmane.org/gmane.linux.kernel/413178
> 
> I would like to propose changing that patch, move the sched_*
> pseudo-files in some other directly and permanently ban putting any new
> file into /sys/devices/system/cpu.
> 
> To get some numbers, you can try
> 
>   http://people.redhat.com/drepper/nproc-timing.c
> 
> The numbers I see on x86-64:
> 
> cpuinfo 10145810 cycles for 100 accesses
> readdir /sys 3113870 cycles for 100 accesses
> stat /sys 741070 cycles for 100 accesses

It sucks when seen from a micro-bench POV, but does it really matter 
overall? The vast majority of software usually calls 
sysconf(_SC_NPROCESSORS_*) with very little frequency (mostly once at 
initialization time) anyway. That's what 50us / call?



- Davide



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 19:15 ` Davide Libenzi
@ 2007-04-03 19:32   ` Ulrich Drepper
  2007-04-04  0:31     ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 19:32 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 962 bytes --]

Davide Libenzi wrote:
> It sucks when seen from a micro-bench POV, but does it really matter 
> overall? The vast majority of software usually calls 
> sysconf(_SC_NPROCESSORS_*) with very little frequency (mostly once at 
> initialization time) anyway. That's what 50us / call?

This is not today's situation.  Yes, 10 years ago when I added the
support to glibc it wasn't much of a problem.  But times change.  As I
said before in this thread, OpenMP by default scales the number of
threads used for a parallel loops depending on the number of available
processors/cores and therefore the number must be retrieved every time
(with perhaps minimal caching of a few secs, but this requires
gettimeofday calls...).  All of a sudden this is not micro benchmark
anymore.  It's a real issue which we only became aware of because it is
noticeable in real life.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 19:32   ` Ulrich Drepper
@ 2007-04-04  0:31     ` H. Peter Anvin
  2007-04-04  0:35       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2007-04-04  0:31 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Davide Libenzi, Linux Kernel, Andrew Morton

Ulrich Drepper wrote:
> Davide Libenzi wrote:
>> It sucks when seen from a micro-bench POV, but does it really matter 
>> overall? The vast majority of software usually calls 
>> sysconf(_SC_NPROCESSORS_*) with very little frequency (mostly once at 
>> initialization time) anyway. That's what 50us / call?
> 
> This is not today's situation.  Yes, 10 years ago when I added the
> support to glibc it wasn't much of a problem.  But times change.  As I
> said before in this thread, OpenMP by default scales the number of
> threads used for a parallel loops depending on the number of available
> processors/cores and therefore the number must be retrieved every time
> (with perhaps minimal caching of a few secs, but this requires
> gettimeofday calls...).  All of a sudden this is not micro benchmark
> anymore.  It's a real issue which we only became aware of because it is
> noticeable in real life.

Sounds like it would need a device which can be waited upon for changes.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  0:31     ` H. Peter Anvin
@ 2007-04-04  0:35       ` Jeremy Fitzhardinge
  2007-04-04  0:38         ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04  0:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ulrich Drepper, Davide Libenzi, Linux Kernel, Andrew Morton

H. Peter Anvin wrote:
> Sounds like it would need a device which can be waited upon for changes.

A vdso-like shared page could have a futex in it.

    J

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  0:35       ` Jeremy Fitzhardinge
@ 2007-04-04  0:38         ` H. Peter Anvin
  2007-04-04  5:09           ` Eric Dumazet
  0 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2007-04-04  0:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ulrich Drepper, Davide Libenzi, Linux Kernel, Andrew Morton

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Sounds like it would need a device which can be waited upon for changes.
> 
> A vdso-like shared page could have a futex in it.

Yes, but a futex couldn't be waited upon with a bunch of other things as 
part of a poll or a select.  The cost of reading the information is minimal.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  0:38         ` H. Peter Anvin
@ 2007-04-04  5:09           ` Eric Dumazet
  2007-04-04  5:16             ` H. Peter Anvin
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Dumazet @ 2007-04-04  5:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Ulrich Drepper, Davide Libenzi, Linux Kernel,
	Andrew Morton, Andi Kleen

H. Peter Anvin a écrit :
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> Sounds like it would need a device which can be waited upon for changes.
>>
>> A vdso-like shared page could have a futex in it.
> 
> Yes, but a futex couldn't be waited upon with a bunch of other things as 
> part of a poll or a select.  The cost of reading the information is 
> minimal.
>

There is one thing that always worried me.

Intel & AMD manuals make clear that mixing data and program in the same page 
is bad for performance.

In particular, x86_64 vsyscall put jiffies and other vsyscall_gtod_data_t 
right in the midle of code. That is certainly not wise.

A probably sane implementation should use two pages, one for code, one for data.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  5:09           ` Eric Dumazet
@ 2007-04-04  5:16             ` H. Peter Anvin
  2007-04-04  5:22               ` Jeremy Fitzhardinge
  2007-04-04  5:29               ` Eric Dumazet
  0 siblings, 2 replies; 56+ messages in thread
From: H. Peter Anvin @ 2007-04-04  5:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jeremy Fitzhardinge, Ulrich Drepper, Davide Libenzi, Linux Kernel,
	Andrew Morton, Andi Kleen

Eric Dumazet wrote:
> 
> There is one thing that always worried me.
> 
> Intel & AMD manuals make clear that mixing data and program in the same 
> page is bad for performance.
> 
> In particular, x86_64 vsyscall put jiffies and other 
> vsyscall_gtod_data_t right in the midle of code. That is certainly not 
> wise.
> 
> A probably sane implementation should use two pages, one for code, one 
> for data.
> 

Mutable data should be separated from code.  I think any current CPU 
will do fine as long as they are in separate 128-byte chunks, but they 
need at least that much separation.

Readonly data does not need to be separated from code.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  5:16             ` H. Peter Anvin
@ 2007-04-04  5:22               ` Jeremy Fitzhardinge
  2007-04-04  5:40                 ` H. Peter Anvin
  2007-04-04  5:29               ` Eric Dumazet
  1 sibling, 1 reply; 56+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-04  5:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric Dumazet, Ulrich Drepper, Davide Libenzi, Linux Kernel,
	Andrew Morton, Andi Kleen

H. Peter Anvin wrote:
> Mutable data should be separated from code.  I think any current CPU
> will do fine as long as they are in separate 128-byte chunks, but they
> need at least that much separation.
P4 manual says that if one processor modifies data within 2k of another
processor executing code, it will trash the entire trace cache.

    J

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  5:22               ` Jeremy Fitzhardinge
@ 2007-04-04  5:40                 ` H. Peter Anvin
  2007-04-04  5:46                   ` Eric Dumazet
  0 siblings, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2007-04-04  5:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric Dumazet, Ulrich Drepper, Davide Libenzi, Linux Kernel,
	Andrew Morton, Andi Kleen

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Mutable data should be separated from code.  I think any current CPU
>> will do fine as long as they are in separate 128-byte chunks, but they
>> need at least that much separation.
> P4 manual says that if one processor modifies data within 2k of another
> processor executing code, it will trash the entire trace cache.

Yuck.  Didn't realize the P4 was that sensitive.  OK, so at the least we 
need a half-page of separation.

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  5:40                 ` H. Peter Anvin
@ 2007-04-04  5:46                   ` Eric Dumazet
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Dumazet @ 2007-04-04  5:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Ulrich Drepper, Davide Libenzi, Linux Kernel,
	Andrew Morton, Andi Kleen

H. Peter Anvin a écrit :
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> Mutable data should be separated from code.  I think any current CPU
>>> will do fine as long as they are in separate 128-byte chunks, but they
>>> need at least that much separation.
>> P4 manual says that if one processor modifies data within 2k of another
>> processor executing code, it will trash the entire trace cache.
> 
> Yuck.  Didn't realize the P4 was that sensitive.  OK, so at the least we 
> need a half-page of separation.

Yes but vsyscall API currently defines 4 entry points :

vsyscall0 (vgettimeofday) = ADDR
vsyscall1 (vtime)    = ADDR+1024
vsyscall2 (vgetcpu) = ADDR+2048
vsyscall3 (vxxxxx)  = ADDR+3072


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  5:16             ` H. Peter Anvin
  2007-04-04  5:22               ` Jeremy Fitzhardinge
@ 2007-04-04  5:29               ` Eric Dumazet
  1 sibling, 0 replies; 56+ messages in thread
From: Eric Dumazet @ 2007-04-04  5:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Ulrich Drepper, Davide Libenzi, Linux Kernel,
	Andrew Morton, Andi Kleen

H. Peter Anvin a écrit :
> Eric Dumazet wrote:
>>
>> There is one thing that always worried me.
>>
>> Intel & AMD manuals make clear that mixing data and program in the 
>> same page is bad for performance.
>>
>> In particular, x86_64 vsyscall put jiffies and other 
>> vsyscall_gtod_data_t right in the midle of code. That is certainly not 
>> wise.
>>
>> A probably sane implementation should use two pages, one for code, one 
>> for data.
>>
> 
> Mutable data should be separated from code.  I think any current CPU 
> will do fine as long as they are in separate 128-byte chunks, but they 
> need at least that much separation.
> 
> Readonly data does not need to be separated from code.
> 

Yes... jiffies & vsyscall_gtod_data_t is writen HZ times per second.

Not really readonly I'm afraid.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 16:54 getting processor numbers Ulrich Drepper
                   ` (3 preceding siblings ...)
  2007-04-03 19:15 ` Davide Libenzi
@ 2007-04-03 20:16 ` Andrew Morton
       [not found]   ` <4612BB89.8040102@redhat.com>
  2007-04-04  2:04   ` Paul Jackson
  4 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2007-04-03 20:16 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel

On Tue, 03 Apr 2007 09:54:46 -0700
Ulrich Drepper <drepper@redhat.com> wrote:

> More and more code depends on knowing the number of processors in the
> system to efficiently scale the code.  E.g., in OpenMP it is used by
> default to determine how many threads to create.  Creating more threads
> than there are processors/cores doesn't make sense.

but...  It would be a mistake for an application to assume that it is
allowed to _use_ all the present CPUs.  People can and do run applications
within cpusets, and under sched_setaffinity().

So I'd have thought that in general an application should be querying its
present affinity mask - something like sched_getaffinity()?  That fixes the
CPU hotplug issues too, of course.

But we discussed this all a couple years back and it was decided that
sched_getaffinity() was unsuitable.  I remember at the time not really
understanding why?

^ permalink raw reply	[flat|nested] 56+ messages in thread

[parent not found: <4612BB89.8040102@redhat.com>]

[parent not found: <20070403141348.9bcdb13e.akpm@linux-foundation.org>]

* Re: getting processor numbers
       [not found]     ` <20070403141348.9bcdb13e.akpm@linux-foundation.org>
@ 2007-04-03 22:13       ` Ulrich Drepper
  2007-04-03 22:48         ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 22:13 UTC (permalink / raw)
  To: Andrew Morton, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

Andrew Morton wrote:
> Did we mean to go off-list?

Oops, no, pressed the wrong button.

>> Andrew Morton wrote:
>>> So I'd have thought that in general an application should be querying its
>>> present affinity mask - something like sched_getaffinity()?  That fixes the
>>> CPU hotplug issues too, of course.
>> Does it really?
>>
>> My recollection is that the affinity masks of running processes is not
>> updated on hotplugging.  Is this addressed?
> 
> ah, yes, you're correct.
> 
> Inside a cpuset:
> 
>   sched_setaffinity() is constrained to those CPUs which are in the
>   cpuset.
> 
>   If a cpu if on/offlined we update each cpuset's cpu mask appropriately
>   but we do not update all the tasks presently running in the cpuset.
> 
> Outside a cpuset:
> 
>   sched_setaffinity() is constrained to all possible cpus
> 
>   We don't update each task's cpus_allowed when a CPU is removed.
> 
> 
> I think we trivially _could_ update each tasks's cpus_allowed mask when a
> CPU is removed, actually.

I think it has to be done.  But that's not so trivial.  What happens if
all the CPUs a process was supposed to be runnable on vanish.
Shouldn't, if no affinity mask is defined, new processors be added?  I
agree that if the process has a defined affinity mask no new processors
should be added _automatically_.


>> If yes, sched_getaffinity is a solution until the NUMA topology
>> framework can provide something better.  Even without a popcnt
>> instruction in the CPU (64-bit albeit) it's twice as fast as the the
>> stat() method proposed.
> 
> I'm surprised - I'd have expected sched_getaffinity() to be vastly quicker
> that doing fileystem operations.

You mean because it's only a factor of two?  Well, it's not once you
count the whole overhead.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 22:13       ` Ulrich Drepper
@ 2007-04-03 22:48         ` Andrew Morton
  2007-04-03 23:00           ` Ulrich Drepper
  2007-04-04  2:52           ` Paul Jackson
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2007-04-03 22:48 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linux Kernel, Gautham R Shenoy, Dipankar Sarma, Paul Jackson

On Tue, 03 Apr 2007 15:13:09 -0700
Ulrich Drepper <drepper@redhat.com> wrote:

> Andrew Morton wrote:
> > Did we mean to go off-list?
> 
> Oops, no, pressed the wrong button.
> 
> >> Andrew Morton wrote:
> >>> So I'd have thought that in general an application should be querying its
> >>> present affinity mask - something like sched_getaffinity()?  That fixes the
> >>> CPU hotplug issues too, of course.
> >> Does it really?
> >>
> >> My recollection is that the affinity masks of running processes is not
> >> updated on hotplugging.  Is this addressed?
> > 
> > ah, yes, you're correct.
> > 
> > Inside a cpuset:
> > 
> >   sched_setaffinity() is constrained to those CPUs which are in the
> >   cpuset.
> > 
> >   If a cpu if on/offlined we update each cpuset's cpu mask appropriately
> >   but we do not update all the tasks presently running in the cpuset.
> > 
> > Outside a cpuset:
> > 
> >   sched_setaffinity() is constrained to all possible cpus
> > 
> >   We don't update each task's cpus_allowed when a CPU is removed.
> > 
> > 
> > I think we trivially _could_ update each tasks's cpus_allowed mask when a
> > CPU is removed, actually.
> 
> I think it has to be done.  But that's not so trivial.  What happens if
> all the CPUs a process was supposed to be runnable on vanish.
> Shouldn't, if no affinity mask is defined, new processors be added?  I
> agree that if the process has a defined affinity mask no new processors
> should be added _automatically_.
> 

Yes, some policy decision needs to be made there.

But whatever we decide to do, the implementation will be relatively
straightforward, because hot-unplug uses stop_machine_run() and later, we
hope, will use the process freezer.  This setting of the whole machine into
a known state means (I think) that we can avoid a whole lot of fuss which
happens when affinity is altered.

Anyway.  It's not really clear who maintains CPU hotplug nowadays.  <adds a
few cc's>.  But yes, I do thing we should do <something sane> with process
affinity when CPU hot[un]plug happens.

Now it could be argued that the current behaviour is that sane thing: we
allow the process to "pin" itself to not-present CPUs and just handle it in
the CPU scheduler.

Paul, could you please describe what cpusets' policy is in the presence of
CPU additional and removal?

> 
> >> If yes, sched_getaffinity is a solution until the NUMA topology
> >> framework can provide something better.  Even without a popcnt
> >> instruction in the CPU (64-bit albeit) it's twice as fast as the the
> >> stat() method proposed.
> > 
> > I'm surprised - I'd have expected sched_getaffinity() to be vastly quicker
> > that doing fileystem operations.
> 
> You mean because it's only a factor of two?  Well, it's not once you
> count the whole overhead.

Is it kernel overhead, or userspace?  The overhead of counting the bits?

Because sched_getaffinity() could be easily sped up in the case where
it is operating on the current process.

Anyway, where do we stand?  Assuming we can address the CPU hotplug issues,
does sched_getaffinity() look like it will be suitable?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 22:48         ` Andrew Morton
@ 2007-04-03 23:00           ` Ulrich Drepper
  2007-04-03 23:23             ` Andrew Morton
                               ` (2 more replies)
  2007-04-04  2:52           ` Paul Jackson
  1 sibling, 3 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Gautham R Shenoy, Dipankar Sarma, Paul Jackson

[-- Attachment #1: Type: text/plain, Size: 1453 bytes --]

Andrew Morton wrote:
> Now it could be argued that the current behaviour is that sane thing: we
> allow the process to "pin" itself to not-present CPUs and just handle it in
> the CPU scheduler.

As a stop-gap solution Jakub will likely implement the sched_getaffinity
hack.  So, it would realy be best to get the masks updated.

But all this of course does not solve the issue sysconf() has.  In
sysconf we cannot use sched_getaffinity since all the systems CPUs must
be reported.

> Is it kernel overhead, or userspace?  The overhead of counting the bits?

The overhead I meant is userland.

> Because sched_getaffinity() could be easily sped up in the case where
> it is operating on the current process.

If there is possibility to treat this case special and make it faster,
please do so.  It would be best to allow pid==0 as a special case so
that callers don't have to find out the TID (which they shouldn't have
to know).

> Anyway, where do we stand?  Assuming we can address the CPU hotplug issues,
> does sched_getaffinity() look like it will be suitable?

It's only usable for the special case on the OpenMP code where the
number of threads is used to determine the number of worker threads.
For sysconf() we still need better support.  Maybe now somebody will
step up and say they need faster sysconf as well.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 23:00           ` Ulrich Drepper
@ 2007-04-03 23:23             ` Andrew Morton
  2007-04-03 23:54               ` Ulrich Drepper
                                 ` (2 more replies)
  2007-04-04  2:58             ` Paul Jackson
  2007-04-04  3:04             ` Paul Jackson
  2 siblings, 3 replies; 56+ messages in thread
From: Andrew Morton @ 2007-04-03 23:23 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Linux Kernel, Gautham R Shenoy, Dipankar Sarma, Paul Jackson,
	Ingo Molnar, Oleg Nesterov

On Tue, 03 Apr 2007 16:00:50 -0700
Ulrich Drepper <drepper@redhat.com> wrote:

> Andrew Morton wrote:
> > Now it could be argued that the current behaviour is that sane thing: we
> > allow the process to "pin" itself to not-present CPUs and just handle it in
> > the CPU scheduler.
> 
> As a stop-gap solution Jakub will likely implement the sched_getaffinity
> hack.  So, it would realy be best to get the masks updated.
> 
> 
> But all this of course does not solve the issue sysconf() has.  In
> sysconf we cannot use sched_getaffinity since all the systems CPUs must
> be reported.

OK.

This is excecptionally gruesome, but one could run sched_getaffinity()
against pid 1 (init).  Which will break nicely in the OS-virtualised future
when the system has multiple pid-1-inits running in containers...

> 
> > Is it kernel overhead, or userspace?  The overhead of counting the bits?
> 
> The overhead I meant is userland.
> 

OK.  Your cost of counting those bits is proportional to CONFIG_NR_CPUS.

It's a bit sad that sys_sched_get_get_affinity() returns sizeof(cpumask_t),
because that means that userspace must handle 256 or whatever CPUs on a
machine which only has two CPUs.

Does anyone see a reason why sys_sched_getaffinity() cannot be altered to
return maximum-possible-cpu-id-on-this-machine?  That way, your hweight
operation will be much faster on sane-sized machines.

> 
> > Because sched_getaffinity() could be easily sped up in the case where
> > it is operating on the current process.
> 
> If there is possibility to treat this case special and make it faster,
> please do so.  It would be best to allow pid==0 as a special case so
> that callers don't have to find out the TID (which they shouldn't have
> to know).
> 

OK.

Does anyone see a reason why we cannot do this?

--- a/kernel/sched.c~sched_getaffinity-speedup
+++ a/kernel/sched.c
@@ -4381,8 +4381,12 @@ long sched_getaffinity(pid_t pid, cpumas
 	struct task_struct *p;
 	int retval;
 
-	lock_cpu_hotplug();
-	read_lock(&tasklist_lock);
+	if (pid) {
+		lock_cpu_hotplug();
+		read_lock(&tasklist_lock);
+	} else {
+		preempt_disable();	/* Prevent CPU hotplugging */
+	}
 
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
@@ -4396,12 +4400,13 @@ long sched_getaffinity(pid_t pid, cpumas
 	cpus_and(*mask, p->cpus_allowed, cpu_online_map);
 
 out_unlock:
-	read_unlock(&tasklist_lock);
-	unlock_cpu_hotplug();
-	if (retval)
-		return retval;
-
-	return 0;
+	if (pid) {
+		read_unlock(&tasklist_lock);
+		unlock_cpu_hotplug();
+	} else {
+		preempt_enable();
+	}
+	return retval;
 }
 
 /**
_

> 
> > Anyway, where do we stand?  Assuming we can address the CPU hotplug issues,
> > does sched_getaffinity() look like it will be suitable?
> 
> It's only usable for the special case on the OpenMP code where the
> number of threads is used to determine the number of worker threads.
> For sysconf() we still need better support.  Maybe now somebody will
> step up and say they need faster sysconf as well.

I guess we could add a simple sys_get_nr_cpus().  If we want more than that
(ie: topology, SMT/MC/NUMA/numa-distance etc) then it gets much more complex
and sysfs is more appropriate for that.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 23:23             ` Andrew Morton
@ 2007-04-03 23:54               ` Ulrich Drepper
  2007-04-04  2:55               ` Paul Jackson
  2007-04-04  8:39               ` Oleg Nesterov
  2 siblings, 0 replies; 56+ messages in thread
From: Ulrich Drepper @ 2007-04-03 23:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel, Gautham R Shenoy, Dipankar Sarma, Paul Jackson,
	Ingo Molnar, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 240 bytes --]

Andrew Morton wrote:
> Does anyone see a reason why we cannot do this?

Shouldn't sched_setaffinity get the same treatment for symmetry reasons?

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 23:23             ` Andrew Morton
  2007-04-03 23:54               ` Ulrich Drepper
@ 2007-04-04  2:55               ` Paul Jackson
  2007-04-04  8:39               ` Oleg Nesterov
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2007-04-04  2:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: drepper, linux-kernel, ego, dipankar, mingo, oleg

Andrew wrote:
> > But all this of course does not solve the issue sysconf() has.  In
> > sysconf we cannot use sched_getaffinity since all the systems CPUs must
> > be reported.
> 
> OK.
> 
> This is excecptionally gruesome, but one could run sched_getaffinity()
> against pid 1 (init).  Which will break nicely in the OS-virtualised future
> when the system has multiple pid-1-inits running in containers...

That nicely breaks on typical cpuset managed systems as well, which
frequently put init into a small cpuset (with the classic Unix daemon
and login load), leaving the bulk of the system to be managed by a
batch scheduler.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 23:23             ` Andrew Morton
  2007-04-03 23:54               ` Ulrich Drepper
  2007-04-04  2:55               ` Paul Jackson
@ 2007-04-04  8:39               ` Oleg Nesterov
  2007-04-04  9:39                 ` Ingo Molnar
  2 siblings, 1 reply; 56+ messages in thread
From: Oleg Nesterov @ 2007-04-04  8:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, Linux Kernel, Gautham R Shenoy, Dipankar Sarma,
	Paul Jackson, Ingo Molnar

On 04/03, Andrew Morton wrote:
>
> On Tue, 03 Apr 2007 16:00:50 -0700
> Ulrich Drepper <drepper@redhat.com> wrote:
> 
> > If there is possibility to treat this case special and make it faster,
> > please do so.  It would be best to allow pid==0 as a special case so
> > that callers don't have to find out the TID (which they shouldn't have
> > to know).
> > 
> 
> OK.
> 
> Does anyone see a reason why we cannot do this?
> 
> --- a/kernel/sched.c~sched_getaffinity-speedup
> +++ a/kernel/sched.c
> @@ -4381,8 +4381,12 @@ long sched_getaffinity(pid_t pid, cpumas
>  	struct task_struct *p;
>  	int retval;
>  
> -	lock_cpu_hotplug();
> -	read_lock(&tasklist_lock);
> +	if (pid) {
> +		lock_cpu_hotplug();
> +		read_lock(&tasklist_lock);
> +	} else {
> +		preempt_disable();	/* Prevent CPU hotplugging */
> +	}

But we don't need tasklist_lock at all, we can use rcu_read_lock/unlock.

Q: don't we need task_rq_lock() to read ->cpus_allowed "atomically" ?

UNTESTED.

--- OLD/kernel/sched.c~	2007-04-03 13:05:02.000000000 +0400
+++ OLD/kernel/sched.c	2007-04-04 12:29:04.000000000 +0400
@@ -4433,22 +4433,17 @@ long sched_setaffinity(pid_t pid, cpumas
 	int retval;
 
 	mutex_lock(&sched_hotcpu_mutex);
-	read_lock(&tasklist_lock);
+	rcu_read_lock();
 
 	p = find_process_by_pid(pid);
 	if (!p) {
-		read_unlock(&tasklist_lock);
+		rcu_read_unlock();
 		mutex_unlock(&sched_hotcpu_mutex);
 		return -ESRCH;
 	}
 
-	/*
-	 * It is not safe to call set_cpus_allowed with the
-	 * tasklist_lock held.  We will bump the task_struct's
-	 * usage count and then drop tasklist_lock.
-	 */
 	get_task_struct(p);
-	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 
 	retval = -EPERM;
 	if ((current->euid != p->euid) && (current->euid != p->uid) &&
@@ -4523,7 +4518,7 @@ long sched_getaffinity(pid_t pid, cpumas
 	int retval;
 
 	mutex_lock(&sched_hotcpu_mutex);
-	read_lock(&tasklist_lock);
+	rcu_read_lock();
 
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
@@ -4537,7 +4532,7 @@ long sched_getaffinity(pid_t pid, cpumas
 	cpus_and(*mask, p->cpus_allowed, cpu_online_map);
 
 out_unlock:
-	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 	mutex_unlock(&sched_hotcpu_mutex);
 	if (retval)
 		return retval;


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  8:39               ` Oleg Nesterov
@ 2007-04-04  9:39                 ` Ingo Molnar
  2007-04-04  8:57                   ` Oleg Nesterov
  0 siblings, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2007-04-04  9:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Ulrich Drepper, Linux Kernel, Gautham R Shenoy,
	Dipankar Sarma, Paul Jackson


* Oleg Nesterov <oleg@tv-sign.ru> wrote:

> But we don't need tasklist_lock at all, we can use 
> rcu_read_lock/unlock. Q: don't we need task_rq_lock() to read 
> ->cpus_allowed "atomically" ?

right now ->cpus_allowed is protected by tasklist_lock. We cannot do RCU 
here because ->cpus_allowed modifications are not RCUified.

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  9:39                 ` Ingo Molnar
@ 2007-04-04  8:57                   ` Oleg Nesterov
  2007-04-04 10:01                     ` Ingo Molnar
  0 siblings, 1 reply; 56+ messages in thread
From: Oleg Nesterov @ 2007-04-04  8:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Ulrich Drepper, Linux Kernel, Gautham R Shenoy,
	Dipankar Sarma, Paul Jackson

On 04/04, Ingo Molnar wrote:
> 
> * Oleg Nesterov <oleg@tv-sign.ru> wrote:
> 
> > But we don't need tasklist_lock at all, we can use 
> > rcu_read_lock/unlock. Q: don't we need task_rq_lock() to read 
> > ->cpus_allowed "atomically" ?
> 
> right now ->cpus_allowed is protected by tasklist_lock. We cannot do RCU 
> here because ->cpus_allowed modifications are not RCUified.

Is it so? that was my question. Afaics, set_cpus_allowed() does
p->cpus_allowed = new_mask under rq->lock, so I don't understand
how tasklist_lock can help.

Could you clarify?

Oleg.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  8:57                   ` Oleg Nesterov
@ 2007-04-04 10:01                     ` Ingo Molnar
  0 siblings, 0 replies; 56+ messages in thread
From: Ingo Molnar @ 2007-04-04 10:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Ulrich Drepper, Linux Kernel, Gautham R Shenoy,
	Dipankar Sarma, Paul Jackson


* Oleg Nesterov <oleg@tv-sign.ru> wrote:

> On 04/04, Ingo Molnar wrote:
> > 
> > * Oleg Nesterov <oleg@tv-sign.ru> wrote:
> > 
> > > But we don't need tasklist_lock at all, we can use 
> > > rcu_read_lock/unlock. Q: don't we need task_rq_lock() to read 
> > > ->cpus_allowed "atomically" ?
> > 
> > right now ->cpus_allowed is protected by tasklist_lock. We cannot do 
> > RCU here because ->cpus_allowed modifications are not RCUified.
> 
> Is it so? that was my question. Afaics, set_cpus_allowed() does 
> p->cpus_allowed = new_mask under rq->lock, so I don't understand how 
> tasklist_lock can help.

you are right, we could (and should) make this depend on rq_lock only - 
i.e. just take away the tasklist_lock like your patch does. It's not 
like the user could expect to observe any ordering between PID lookup 
and affinity-mask changes. And my RCU comment is bogus: it's not like we 
allocate ->cpus_allowed :-/

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 23:00           ` Ulrich Drepper
  2007-04-03 23:23             ` Andrew Morton
@ 2007-04-04  2:58             ` Paul Jackson
  2007-04-04  3:04             ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2007-04-04  2:58 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: akpm, linux-kernel, ego, dipankar

Ulrich wrote:
> But all this of course does not solve the issue sysconf() has.  In
> sysconf we cannot use sched_getaffinity since all the systems CPUs must
> be reported.

Good point.  Yes, sysconf(_SC_NPROCESSORS_CONF) really needs to continue
returning the number of CPUs online, to maintain compatibility with the
current implementation, and because with a name like that, just about
anything else would be "surprising".

And OpenMPI shouldn't be calling sysconf(_SC_NPROCESSORS_CONF), as it does
not want to know how much hardware is there, but rather how much hardware
it is allowed to use.  Something based on sched_getaffinity() would seem
to be ideal for its purposes.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 23:00           ` Ulrich Drepper
  2007-04-03 23:23             ` Andrew Morton
  2007-04-04  2:58             ` Paul Jackson
@ 2007-04-04  3:04             ` Paul Jackson
  2 siblings, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2007-04-04  3:04 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: akpm, linux-kernel, ego, dipankar

Ulrich wrote:
> For sysconf() we still need better support.  Maybe now somebody will
> step up and say they need faster sysconf as well.

That won't be me ;).

For any kernel compiled with CONFIG_CPUSETS (which includes the major
distros I am aware of), one can just count the bits in the top cpusets
'cpus' value, which is always the online CPU mask.  Even if your user
level software is making no conscious use of cpusets, this works, for
kernels so configured.

If there is someone who needs both a faster sysconf
(_SC_NPROCESSORS_CONF) and a kernel without CPUSETS configured, then
they will have to speak up for themselves.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 22:48         ` Andrew Morton
  2007-04-03 23:00           ` Ulrich Drepper
@ 2007-04-04  2:52           ` Paul Jackson
  1 sibling, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2007-04-04  2:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: drepper, linux-kernel, ego, dipankar, cpw

Andrew wrote:
> Paul, could you please describe what cpusets' policy is in the presence of
> CPU additional and removal?

Currently, if we remove the last CPU in a cpuset, we give that cpuset
the CPUs of its parent cpuset, in order to ensure that every cpuset
with tasks attached actually has some CPUs they can run on.  See the
routine kernel/cpuset.c:guarantee_online_cpus_mems_in_subtree(), and
kernel/cpuset.c:guarantee_online_cpus(), and close your eyes to the
recursion ... yeah that has to get fixed someday ;).

But Cliff Wickman <cpw@sgi.com> (added to CC) figured out that this was
broken, as it could easily violate the cpu_exclusive property of
cpusets.  He is working on a patch that will move the tasks in the
CPU-deficient cpuset up to their parent cpuset.

In general, the idea is to ensure that every task has at least one CPU
on which it can run.  If the sysadmin intends to unplug all the CPUs
required by some task, he "should" first move those tasks somewhere
else where they can continue to run.  If he doesn't, then the kernel
"makes do", by either (currently) adding CPUs to the CPU-deficient
cpuset, or (future) moving the CPU-deprived tasks somewhere else.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-03 20:16 ` Andrew Morton
       [not found]   ` <4612BB89.8040102@redhat.com>
@ 2007-04-04  2:04   ` Paul Jackson
  2007-04-04  6:47     ` Jakub Jelinek
  1 sibling, 1 reply; 56+ messages in thread
From: Paul Jackson @ 2007-04-04  2:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: drepper, linux-kernel

Andrew wrote:
> I'd have thought that in general an application should be querying its
> present affinity mask - something like sched_getaffinity()?  That fixes the
> CPU hotplug issues too, of course.

The sched_getaffinity call is quick, too, and it nicely reflects any
cpuset constraints, while still working on kernels which don't have
CPUSETs configured.

There are really at least four "number of CPUs" answers here, and we
should be aware of which we are providing.  There are, in order of
decreasing size:
 1) the size of the kernels cpumask_t (NR_CPUS),
 2) the maximum number of CPUs that might ever be hotplugged into a
    booted system,
 3) the current number of CPUs online in that system, and
 4) the number of CPUs that the current task is allowed to use.

I would suggest that (4) is what we should typically return.
Certainly it would seem that the use that Ulrich is concerned with,
by OpenMP, wants (4).

Currently, the sysconf(_SC_NPROCESSORS_CONF) returns (3), by counting
the CPUs in /proc/stat, which is rather bogus on cpuset, or even
sched_setaffinity, constrained systems.

> But we discussed this all a couple years back and it was decided that
> sched_getaffinity() was unsuitable.  I remember at the time not really
> understanding why?

Perhaps it was because a robust invocation of sched_getaffinity takes
a page of code to write, as Andi Kleen noticed in his libnuma coding of
"static int number_of_cpus(void)".  One has to size the mask passed in,
in case the kernel was compiled with a larger cpumask_t size than you
guessed up front.  In other words, to get (4) using sched_getaffinity,
one first needs an upper bound on (1), above, the kernels configured
NR_CPUS.

One can either size it by repeatedly invoking sched_getaffinity with
larger masks until it stops failing EINVAL, or one can examine the
length of the "Cpus_allowed" mask displayed in /proc/self/status
(it takes 9 ascii chars, if you include the commas and trailing
newline, to display each 32 bits of cpumask_t.)  There may be other
ways as well; those seem to be the most common.

At least the kernel cpumask_t size can be cached for the life of the
process and any descendents, so the cost of obtaining it should be less
critical.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  2:04   ` Paul Jackson
@ 2007-04-04  6:47     ` Jakub Jelinek
  2007-04-04  7:02       ` Paul Jackson
  2007-04-04 14:51       ` Cliff Wickman
  0 siblings, 2 replies; 56+ messages in thread
From: Jakub Jelinek @ 2007-04-04  6:47 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andrew Morton, drepper, linux-kernel

On Tue, Apr 03, 2007 at 07:04:58PM -0700, Paul Jackson wrote:
> Andrew wrote:
> > I'd have thought that in general an application should be querying its
> > present affinity mask - something like sched_getaffinity()?  That fixes the
> > CPU hotplug issues too, of course.
> 
> The sched_getaffinity call is quick, too, and it nicely reflects any
> cpuset constraints, while still working on kernels which don't have
> CPUSETs configured.
> 
> There are really at least four "number of CPUs" answers here, and we
> should be aware of which we are providing.  There are, in order of
> decreasing size:
>  1) the size of the kernels cpumask_t (NR_CPUS),
>  2) the maximum number of CPUs that might ever be hotplugged into a
>     booted system,
>  3) the current number of CPUs online in that system, and
>  4) the number of CPUs that the current task is allowed to use.
> 
> I would suggest that (4) is what we should typically return.
> Certainly it would seem that the use that Ulrich is concerned with,
> by OpenMP, wants (4).
> 
> Currently, the sysconf(_SC_NPROCESSORS_CONF) returns (3), by counting
> the CPUs in /proc/stat, which is rather bogus on cpuset, or even
> sched_setaffinity, constrained systems.

OpenMP wants (4) and I'll change it that way.

sysconf(_SC_NPROCESSORS_ONLN) must return (3) (this currently scans /proc/stat)
and
sysconf(_SC_NPROCESSORS_CONF) should IMHO return (2) (this currently
scans /proc/cpuinfo on alpha and sparc{,64} for ((ncpus|CPUs) probed|cpus detected)
and for the rest just returns sysconf(_SC_NPROCESSORS_ONLN)).
Neither of the sysconf returned values should be affected by affinity.

	Jakub

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  6:47     ` Jakub Jelinek
@ 2007-04-04  7:02       ` Paul Jackson
  2007-04-04 14:51       ` Cliff Wickman
  1 sibling, 0 replies; 56+ messages in thread
From: Paul Jackson @ 2007-04-04  7:02 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: akpm, drepper, linux-kernel

Jakub wrote:
> OpenMP wants (4) and I'll change it that way.

Good.  Are you referring to a change in glibc or in OpenMP?

> sysconf(_SC_NPROCESSORS_ONLN) must return (3) (this currently scans /proc/stat)

Ok

> sysconf(_SC_NPROCESSORS_CONF) should IMHO return (2) (this currently
> scans /proc/cpuinfo on alpha and sparc{,64} for ((ncpus|CPUs) probed|cpus detected)
> and for the rest just returns sysconf(_SC_NPROCESSORS_ONLN)).

Not quite what I would have guessed, but seems ok.

Thanks for spelling this out.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: getting processor numbers
  2007-04-04  6:47     ` Jakub Jelinek
  2007-04-04  7:02       ` Paul Jackson
@ 2007-04-04 14:51       ` Cliff Wickman
  1 sibling, 0 replies; 56+ messages in thread
From: Cliff Wickman @ 2007-04-04 14:51 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: linux-kernel, pj


On Wed, Apr 04, 2007 at 02:47:32AM -0400, Jakub Jelinek wrote:
> On Tue, Apr 03, 2007 at 07:04:58PM -0700, Paul Jackson wrote:
> > There are really at least four "number of CPUs" answers here, and we
> > should be aware of which we are providing.  There are, in order of
> > decreasing size:
> >  1) the size of the kernels cpumask_t (NR_CPUS),
> >  2) the maximum number of CPUs that might ever be hotplugged into a
> >     booted system,
> >  3) the current number of CPUs online in that system, and
> >  4) the number of CPUs that the current task is allowed to use.
>
> sysconf(_SC_NPROCESSORS_CONF) should IMHO return (2) (this currently
> scans /proc/cpuinfo on alpha and sparc{,64} for ((ncpus|CPUs) probed|cpus detected)
> and for the rest just returns sysconf(_SC_NPROCESSORS_ONLN)).
> Neither of the sysconf returned values should be affected by affinity.

I'm looking at an ia64 system, and when a cpu is hot-unplugged it is removed
from /proc/cpuinfo.  Wouldn't /sys/devices/system/cpu/ be a better
source for 2) ?

-- 
Cliff Wickman
Silicon Graphics, Inc.
cpw@sgi.com
(651) 683-3824

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2007-04-04 14:53 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 16:54 getting processor numbers Ulrich Drepper
2007-04-03 17:30 ` linux-os (Dick Johnson)
2007-04-03 17:37   ` Ulrich Drepper
2007-04-03 17:56 ` Dr. David Alan Gilbert
2007-04-03 18:11 ` Andi Kleen
2007-04-03 17:17   ` Ulrich Drepper
2007-04-03 17:22     ` Alan Cox
2007-04-03 17:30       ` Andi Kleen
2007-04-03 20:24         ` Jeremy Fitzhardinge
2007-04-03 17:27     ` Andi Kleen
2007-04-03 17:30       ` Ulrich Drepper
2007-04-03 17:35         ` Andi Kleen
2007-04-03 17:45           ` Ulrich Drepper
2007-04-03 17:58             ` Andi Kleen
2007-04-03 18:05               ` Ulrich Drepper
2007-04-03 18:11                 ` Andi Kleen
2007-04-03 18:21                   ` Ulrich Drepper
2007-04-03 17:44         ` Siddha, Suresh B
2007-04-03 17:59           ` Ulrich Drepper
2007-04-03 19:40             ` Jakub Jelinek
2007-04-03 20:13             ` Ingo Oeser
2007-04-03 23:38               ` J.A. Magallón
2007-04-03 19:55           ` Ulrich Drepper
2007-04-03 20:13             ` Siddha, Suresh B
2007-04-03 20:19               ` Ulrich Drepper
2007-04-03 20:32                 ` Eric Dumazet
2007-04-03 20:20             ` Nathan Lynch
2007-04-03 19:15 ` Davide Libenzi
2007-04-03 19:32   ` Ulrich Drepper
2007-04-04  0:31     ` H. Peter Anvin
2007-04-04  0:35       ` Jeremy Fitzhardinge
2007-04-04  0:38         ` H. Peter Anvin
2007-04-04  5:09           ` Eric Dumazet
2007-04-04  5:16             ` H. Peter Anvin
2007-04-04  5:22               ` Jeremy Fitzhardinge
2007-04-04  5:40                 ` H. Peter Anvin
2007-04-04  5:46                   ` Eric Dumazet
2007-04-04  5:29               ` Eric Dumazet
2007-04-03 20:16 ` Andrew Morton
     [not found]   ` <4612BB89.8040102@redhat.com>
     [not found]     ` <20070403141348.9bcdb13e.akpm@linux-foundation.org>
2007-04-03 22:13       ` Ulrich Drepper
2007-04-03 22:48         ` Andrew Morton
2007-04-03 23:00           ` Ulrich Drepper
2007-04-03 23:23             ` Andrew Morton
2007-04-03 23:54               ` Ulrich Drepper
2007-04-04  2:55               ` Paul Jackson
2007-04-04  8:39               ` Oleg Nesterov
2007-04-04  9:39                 ` Ingo Molnar
2007-04-04  8:57                   ` Oleg Nesterov
2007-04-04 10:01                     ` Ingo Molnar
2007-04-04  2:58             ` Paul Jackson
2007-04-04  3:04             ` Paul Jackson
2007-04-04  2:52           ` Paul Jackson
2007-04-04  2:04   ` Paul Jackson
2007-04-04  6:47     ` Jakub Jelinek
2007-04-04  7:02       ` Paul Jackson
2007-04-04 14:51       ` Cliff Wickman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox