From mboxrd@z Thu Jan  1 00:00:00 1970
From: jmayo@nvidia.com (Jon Mayo)
Date: Wed, 22 Jun 2011 12:26:11 -0700
Subject: [PATCH] ARM: report present cpus in /proc/cpuinfo
In-Reply-To: <20110622093623.GP23234@n2100.arm.linux.org.uk>
References: <4E012198.6010405@nvidia.com>
	<20110621230512.GL23234@n2100.arm.linux.org.uk>
	<4E012820.2090208@nvidia.com>
	<20110621233619.GM23234@n2100.arm.linux.org.uk>
	<4E01326C.1060808@nvidia.com>
	<20110622093623.GP23234@n2100.arm.linux.org.uk>
Message-ID: <4E0241D3.4060303@nvidia.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On 06/22/2011 02:36 AM, Russell King - ARM Linux wrote:
> On Tue, Jun 21, 2011 at 05:08:12PM -0700, Jon Mayo wrote:
>> This issue has had me concerned for a while. Because in userspace it can
>> be advantageous to allocate per-cpu structures on start-up for some
>> threading tricks. but if you use the wrong count, funny things can
>> happen.
>
> Again, if you look at the glibc sources, you'll find that they (in
> theory) provide two different calls - getconf(_SC_NPROCESSORS_CONF)
> and getconf(_SC_NPROCESSORS_ONLN).  See sysdeps/posix/sysconf.c.
>
> That suggests you you should be using getconf(_SC_NPROCESSORS_CONF).
>
> However, these are provided by __get_nprocs_conf() and __get_nprocs()
> respectively. See sysdeps/unix/sysv/linux/getsysstats.c.  Notice this
> comment:
>
> /* As far as I know Linux has no separate numbers for configured and
>     available processors.  So make the `get_nprocs_conf' function an
>     alias.  */
> strong_alias (__get_nprocs, __get_nprocs_conf)
>
> So, getconf(_SC_NPROCESSORS_CONF) and getconf(_SC_NPROCESSORS_ONLN)
> will probably return the same thing - and it seems to me that you
> require that to be fixed.  That's not for us in ARM to sort out -
> that's a _generic_ kernel and glibc issue, and needs to be discussed
> elsewhere.
>

Thanks for that. I don't look at glibc too much. I tend to run 
everything but glibc.

>>>> I don't think the behavior of ARM linux makes sense. Neither change is
>>>> truly correct in my mind. What I feel is the correct behavior is a list
>>>> (in both stat and cpuinfo) of all cpus either running a task or ready to
>>>> run a task.
>>>
>>> That _is_ what you have listed in /proc/cpuinfo and /proc/stat.
>>>
>>
>> What I see is my idle cpus are not there because we hot unplug them so
>> their power domains can be turned off. scheduling them can happen, but
>> only if an extra step occurs. From user space it's transparent, from
>> kernel space, there is a whole framework making decisions about when to
>> dynamically turn on what.
>
> Exactly.  You're complaining that the kernels interpretation of the masks
> is not correct because you're driving it with a user program which
> effectively changes that behaviour.
>

small correction to your statement. I'm driving it entirely with the 
kernel. And presenting something that isn't quite what user programs expect.

> So, if we change the interpretation of the masks, we'll then have people
> who aren't using your user program complaining that the masks are wrong
> for them.  It's a no-win situation - there is no overall benefit to
> changing the kernel.
>
> The fact that you're using a user program which dynamically hot-plugs
> CPUs means that _you're_ changing the system behaviour by running that
> program, and _you're_ changing the meaning of those masks.
>

yea, I'm not doing that. This is stuff in mach-tegra that dynamically 
hotplugs CPUs.

>>>> cpu_possible_mask, cpu_present_mask, and cpu_online_mask
>>>> don't have semantics on ARM that I feel is right. (I don't understand
>>>> what cpu_active_mask is, but it's probably not what I want either)
>>>
>>> They have their defined meaning.
>>>
>>> cpu_possible_mask - the CPU number may be available
>>> cpu_present_mask - the CPU number is present and is available to be brought
>>> 	online upon request by the hotplug code
>>> cpu_online_mask - the CPU is becoming available for scheduling
>>> cpu_active_mask - the CPU is fully online and available for scheduling
>>>
>>> CPUs only spend a _very_ short time in the online but !active state
>>> (about the time it takes the CPU asking for it to be brought up to
>>> notice that it has been brought up, and for the scheduler migration
>>> code to receive the notification that the CPU is now online.)  So
>>> you can regard the active mask as a mere copy of the online mask for
>>> most purposes.
>>>
>>> CPUs may be set in the possible mask but not the present mask - that
>>> can happen if you limit the number of CPUs on the kernel command line.
>>> However, we have no way to bring those CPUs to "present" status, and
>>> so they are not available for bringing online - as far as the software
>>> is concerned, they're as good as being physically unplugged.
>>>
>>
>> I don't see a use for that semantic. Why shouldn't we add a couple lines
>> of code to the kernel to scrub out unusable situations?
>
> Think about it - if you have real hot-pluggable CPUs (servers do), do
> you _really_ want to try to bring online a possible CPU (iow, there's
> a socket on the board) but one which isn't present (iow, the socket is
> empty.)
>
> That's what the possible + !present case caters for.  Possible tells
> the kernel how many CPUs to allocate per-cpu data structures for.
> present tells it whether a CPU can be onlined or not.
>

Yes, that's the difference between present and possible. I'm not 
suggesting we report cpus that do not exist. I'm suggesting we report 
cpus that are present, online or not.

>>> So, we (in arch/arm) can't change that decision.  Same for online&&
>>> active must both be set in order for any process to be scheduled onto
>>> that CPU - if any process is on a CPU which is going offline (and
>>> therefore !active, !online) then it will be migrated off that CPU by
>>> generic code before the CPU goes offline.
>>>
>>
>> I will accept that. But then does that mean we (either arch/arm or
>> mach-tegra) have used the cpu hotplug system incorrectly?
>
> It means you're using it in ways that it was not originally designed
> to be used - which is for the physical act of hot-plugging CPUs in
> servers.  CPU hotplug was never designed from the outset for this
> kind of dynamic CPU power management.
>
> Yes, you _can_ use the CPU hotplug interfaces to do this, but as you're
> finding, there are problems with doing this.
>
> We can't go around making ARM use CPU hotplug differently from everyone
> else because that'll make things extremely fragile.  As you've already
> found out, glibc getconf() ultimately uses /proc/stat to return the
> number of CPUs.  So in order to allow dynamic hotplugging _and_ return
> the sensible 'online CPUs' where 'online' means both those which are
> currently running and dormant, you need to change generic code.
>

I agree 100%.

> Plus, there's the issue of CPU affinity for processes and IRQs.  With
> current CPU hotplug, a process which has chosen to bind to a particular
> CPU will have that binding destroyed when the CPU is hot unplugged, and
> its affinity will be broken.  It will be run on a different CPU.  This
> is probably not the semantics you desire.
>

Or maybe it is. I haven't decided yet. For power reasons I might want to 
ignore the affinity until demand goes up for more cores.

> I'd argue that trying to do dynamic hotplugging is the wrong approach,
> especially as there is CPUidle (see below.)
>
>>> I think what you're getting confused over is that within nvidia, you're
>>> probably dynamically hotplugging CPUs, and so offline CPUs are apparantly
>>> available to the system if the load on the system rises.  That's not
>>> something in the generic kernel, and is a custom addition.  Such an
>>> addition _can_ be seen to change the definition of the above masks,
>>> but that's not the fault of the kernel - that's the way you're driving
>>> the hotplug system.
>>>
>>
>> sorry. I thought we weren't the only one in arm driving it this way. if
>> what we've done is strange, I'd like to correct it.
>
> I'm aware of ARM having done something like this in their SMP group
> in the early days of SMP support, but I wasn't aware that it was still
> being actively persued.  I guess that it never really got out of the
> prototyping stage (or maybe it did but they chose not to address these
> issues.)
>
>> Like if I were to think of a big mainframe or xeon server with hotplug
>> cpus, the way the masks work makes perfect sense. I push a button, all
>> the processes get cleared from the cpu, it is marked ass offline. I pull
>> the card from the cabinet, and then it is !present. and maybe instead a
>> new card at a later date. it's just like any other sort of hotplug thing.
>>
>> I think my issue with cpuinfo/stat's output is with the semantics for
>> "online" being different for this one architecture (mach-tegra) and
>> possibly others (??) than what I would expect.
>
> No, it's no different.  As I've explained above, the difference is that
> you're running a userspace program which automatically does the
> hotplugging depending on the system load.
>

no. I'm not. no user space program at all.

> That _can_ be viewed as fundamentally changing the system behaviour
> because CPUs which are offlined are still available for scheduling
> should the system load become high enough to trigger it.
>
> I think there's an easier way to solve this problem: there is the CPU
> idle infrastructure, which allows idle CPUs to remain online while
> allowing them to power down when they're not required.  Because they
> remain online, the scheduler will migrate tasks to them if they're
> not doing anything, and maybe that's something else to look at.
>
> The other advantage of CPUidle is that you're not breaking the affinity
> of anything when the CPU is powered down, unlike hotplug.
>
> So, I think you really should be looking at CPUidle, rather than trying
> to do dynamic hotplugging based on system load.

The disadvantage of CPUidle is there is no way, that I can see, to 
handle asymmetric power domains. I don't ever want to turn off cpu0, 
it's power domain is coupled to a bunch of other things. but any 
additional cores (1 or more) are on a different power domain (shared 
between all additional cores).

if cpu0 is idle, I will want to kick everyone off cpu1 and push them 
onto cpu0, then shut cpu1 off. It gets worse with more cores (a bunch of 
companies announced 4 core ARMs already, for example).