linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* RFC: Dynamic hwcaps
@ 2010-12-03 16:28 Dave Martin
  2010-12-03 16:43 ` Jesse Barker
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Dave Martin @ 2010-12-03 16:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

I'd be interested in people's views on the following idea-- feel free
to ignore if it doesn't interest you.


For power-management purposes, it's useful to be able to turn off
functional blocks on the SoC.

For on-SoC peripherals, this can be managed through the driver
framework in the kernel, but for functional blocks of the CPU itself
which are used by instruction set extensions, such as NEON or other
media accelerators, it would be interesting if processes could adapt
to these units appearing and disappearing at runtime.  This would mean
that user processes would need to select dynamically between different
implementations of accelerated functionality at runtime.

This allows for more active power management of such functional
blocks: if the CPU is not fully loaded, you can turn them off -- the
kernel can spot when there is significant idle time and do this.  If
the CPU becomes fully loaded, applications which have soft-realtime
constraints can notice this and switch to their accelerated code
(which will cause the kernel to switch the functional unit(s) on).
Or, the kernel can react to increasing CPU load by speculatively turn
it on instead.  This is analogous to the behaviour of other power
governors in the system.  Non-aware applications will still work
seamlessly -- these may simply run accelerated code if the hardware
supports it, causing the kernel to turn the affected functional
block(s) on.

In order for this to work, some dynamic status information would need
to be visible to each user process, and polled each time a function
with a dynamically switchable choice of implementations gets called.
You probably don't need to worry about race conditions either-- if the
process accidentally tries to use a turned-off feature, you will take
a fault which gives the kernel the chance to turn the feature back on.
 Generally, this should be a rare occurrence.


The dynamic feature status information should ideally be per-CPU
global, though we could have a separate copy per thread, at the cost
of more memory.  It can't be system-global, since different CPUs may
have a different set of functional blocks active at any one time --
for this reason, the information can't be stored in an existing
mapping such as the vectors page.  Conversely, existing mechanisms
such sysfs probably involve too much overhead to be polled every time
you call copy_pixmap() or whatever.

Alternatively, each thread could register a userspace buffer (a single
word is probably adequate) into which the CPU pokes the hardware
status flags each time it returns to userspace, if the hardware status
has changed or if the thread has been migrated.

Either of the above approaches could be prototyped as an mmap'able
driver, though this may not be the best approach in the long run.


Does anyone have a view on whether this is a worthwhile idea, or what
the best approach would be?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-03 16:28 RFC: Dynamic hwcaps Dave Martin
@ 2010-12-03 16:43 ` Jesse Barker
  2010-12-03 16:51 ` Russell King - ARM Linux
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Jesse Barker @ 2010-12-03 16:43 UTC (permalink / raw)
  To: linux-arm-kernel

Dave,

For the case of NEON and its use in graphics libraries, we are certainly
pushing explicitly for runtime detection.  However, this tends to be done by
detecting the presence of NEON at initialization time, rather than at each
path invocation (to avoid rescanning /proc/self/auxv).  Are you saying that
the init code could still detect NEON this way, but there would need to be
additional checks when taking individual paths?

cheers,
Jesse

On Fri, Dec 3, 2010 at 8:28 AM, Dave Martin <dave.martin@linaro.org> wrote:

> Hi all,
>
> I'd be interested in people's views on the following idea-- feel free
> to ignore if it doesn't interest you.
>
>
> For power-management purposes, it's useful to be able to turn off
> functional blocks on the SoC.
>
> For on-SoC peripherals, this can be managed through the driver
> framework in the kernel, but for functional blocks of the CPU itself
> which are used by instruction set extensions, such as NEON or other
> media accelerators, it would be interesting if processes could adapt
> to these units appearing and disappearing at runtime.  This would mean
> that user processes would need to select dynamically between different
> implementations of accelerated functionality at runtime.
>
> This allows for more active power management of such functional
> blocks: if the CPU is not fully loaded, you can turn them off -- the
> kernel can spot when there is significant idle time and do this.  If
> the CPU becomes fully loaded, applications which have soft-realtime
> constraints can notice this and switch to their accelerated code
> (which will cause the kernel to switch the functional unit(s) on).
> Or, the kernel can react to increasing CPU load by speculatively turn
> it on instead.  This is analogous to the behaviour of other power
> governors in the system.  Non-aware applications will still work
> seamlessly -- these may simply run accelerated code if the hardware
> supports it, causing the kernel to turn the affected functional
> block(s) on.
>
> In order for this to work, some dynamic status information would need
> to be visible to each user process, and polled each time a function
> with a dynamically switchable choice of implementations gets called.
> You probably don't need to worry about race conditions either-- if the
> process accidentally tries to use a turned-off feature, you will take
> a fault which gives the kernel the chance to turn the feature back on.
>  Generally, this should be a rare occurrence.
>
>
> The dynamic feature status information should ideally be per-CPU
> global, though we could have a separate copy per thread, at the cost
> of more memory.  It can't be system-global, since different CPUs may
> have a different set of functional blocks active at any one time --
> for this reason, the information can't be stored in an existing
> mapping such as the vectors page.  Conversely, existing mechanisms
> such sysfs probably involve too much overhead to be polled every time
> you call copy_pixmap() or whatever.
>
> Alternatively, each thread could register a userspace buffer (a single
> word is probably adequate) into which the CPU pokes the hardware
> status flags each time it returns to userspace, if the hardware status
> has changed or if the thread has been migrated.
>
> Either of the above approaches could be prototyped as an mmap'able
> driver, though this may not be the best approach in the long run.
>
>
> Does anyone have a view on whether this is a worthwhile idea, or what
> the best approach would be?
>
> Cheers
> ---Dave
>
> _______________________________________________
> linaro-dev mailing list
> linaro-dev at lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20101203/c203f894/attachment-0001.html>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-03 16:28 RFC: Dynamic hwcaps Dave Martin
  2010-12-03 16:43 ` Jesse Barker
@ 2010-12-03 16:51 ` Russell King - ARM Linux
  2010-12-03 17:35   ` Dave Martin
  2010-12-05 14:12 ` Thomas Petazzoni
  2010-12-07 21:15 ` Ben Dooks
  3 siblings, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2010-12-03 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 03, 2010 at 04:28:27PM +0000, Dave Martin wrote:
> For on-SoC peripherals, this can be managed through the driver
> framework in the kernel, but for functional blocks of the CPU itself
> which are used by instruction set extensions, such as NEON or other
> media accelerators, it would be interesting if processes could adapt
> to these units appearing and disappearing at runtime.  This would mean
> that user processes would need to select dynamically between different
> implementations of accelerated functionality at runtime.

The ELF hwcaps are used by the linker to determine what facilities
are available, and therefore which dynamic libraries to link in.

For instance, if you have a selection of C libraries on your platform
built for different features - eg, lets say you have a VFP based
library and a soft-VFP based library.

If the linker sees - at application startup - that HWCAP_VFP is set,
it will select the VFP based library.  If HWCAP_VFP is not set, it
will select the soft-VFP based library instead.

A VFP-based library is likely to contain VFP instructions, sometimes
in the most unlikely of places - eg, printf/scanf is likely to invoke
VFP instructions even when they aren't dealing with floating point in
their format string.

The problem comes is if you take away HWCAP_VFP after an application
has been bound to the hard-VFP library, there is no way, sort of
killing and re-exec'ing the program, to change the libraries that it
is bound to.

> In order for this to work, some dynamic status information would need
> to be visible to each user process, and polled each time a function
> with a dynamically switchable choice of implementations gets called.
> You probably don't need to worry about race conditions either-- if the
> process accidentally tries to use a turned-off feature, you will take
> a fault which gives the kernel the chance to turn the feature back on.

Yes, you can use a fault to re-enable some features such as VFP.

> The dynamic feature status information should ideally be per-CPU
> global, though we could have a separate copy per thread, at the cost
> of more memory.

Threads are migrated across CPUs so you can't rely on saying CPU0 has
VFP powered up and CPU1 has VFP powered down, and then expect that
threads using VFP will remain on CPU0.  The system will spontaneously
move that thread to CPU1 if CPU1 is less loaded than CPU0.

I think what may be possible is to hook VFP power state into the code
which enables/disables access to VFP.

However, I'm not aware of any platforms or CPUs where (eg) VFP is
powered or clocked independently to the main CPU.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-03 16:51 ` Russell King - ARM Linux
@ 2010-12-03 17:35   ` Dave Martin
  2010-12-05 15:14     ` Mark Mitchell
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Martin @ 2010-12-03 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Dec 3, 2010 at 4:51 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Dec 03, 2010 at 04:28:27PM +0000, Dave Martin wrote:
>> For on-SoC peripherals, this can be managed through the driver
>> framework in the kernel, but for functional blocks of the CPU itself
>> which are used by instruction set extensions, such as NEON or other
>> media accelerators, it would be interesting if processes could adapt
>> to these units appearing and disappearing at runtime. ?This would mean
>> that user processes would need to select dynamically between different
>> implementations of accelerated functionality at runtime.
>
> The ELF hwcaps are used by the linker to determine what facilities
> are available, and therefore which dynamic libraries to link in.
>
> For instance, if you have a selection of C libraries on your platform
> built for different features - eg, lets say you have a VFP based
> library and a soft-VFP based library.
>
> If the linker sees - at application startup - that HWCAP_VFP is set,
> it will select the VFP based library. ?If HWCAP_VFP is not set, it
> will select the soft-VFP based library instead.
>
> A VFP-based library is likely to contain VFP instructions, sometimes
> in the most unlikely of places - eg, printf/scanf is likely to invoke
> VFP instructions even when they aren't dealing with floating point in
> their format string.

True... this is most likely to be useful for specialised functional
units which are used in specific places (such as NEON), and which
aren't distributed throughout the code.  As you say, in
general-purpose code built with -mfpu=vfp*, VFP is distributed all
over the place, so you'd probably see a net cost as you thrash turning
VFP on and off.  The point may be moot-- I'm not aware of a SoC which
can power-manage VFP; but NEON might be different.

What you describe is one of two mechanisms currently in use--- the
other is for a single library to contain two implementations of
certain functions and to choose between them based on the hwcaps.
Typically, one set of functions is chosen a library initialisation
time.  Some libraries, such as libpixman, are implementated this way;
and it's often preferable since the the proportion of functions in a
library which get significant benefit from special instruction set
extensions is often pretty small.  So you avoid having duplicate
copies of libraries in the filesystem.  (Of course, if the distro's
packager was intelligent enough, it could avoid installing the
duplicate, but that's a separate issue.)

Unfortunately, glibc does a good job of hiding not only the hwcaps
passed on the initial stack but also the derived information which
drives shared library selection (or at least frustrates reliable
access to this information); so generally code which wants to check
the hwcaps must read /proc/self/auxv (or parse /proc/cpuinfo ... but
that's more laborious).  However, the cost isn't too problematic if
this only happens once, when a library is initialised.

In the near future, STT_IFUNC support in the tools and ld.so may add
to the mix, by allowing the dynamic linker to select different
implementations of code at the function level, not just the
whole-library level.  If so, this will provide a better way to
implement the optimised function selection challenge outlined above.

>
> The problem comes is if you take away HWCAP_VFP after an application
> has been bound to the hard-VFP library, there is no way, sort of
> killing and re-exec'ing the program, to change the libraries that it
> is bound to.

Agreed--- the application has to be aware in order for this to become
really useful.

However, to be clear, I'm not suggesting that the kernel should _ever_
break the contract embodied in /proc/cpuinfo, or the hwcaps passed at
process startup.  If the hwcaps say NEON is supported then it must be
supported (though this is allowed to involve a fault and a possible
SoC-specific delay while the functional unit is brought back online).

Rather, the dynamic status would indicate whether or not the
functional unit is in a "ready" state or not.

>
>> In order for this to work, some dynamic status information would need
>> to be visible to each user process, and polled each time a function
>> with a dynamically switchable choice of implementations gets called.
>> You probably don't need to worry about race conditions either-- if the
>> process accidentally tries to use a turned-off feature, you will take
>> a fault which gives the kernel the chance to turn the feature back on.
>
> Yes, you can use a fault to re-enable some features such as VFP.
>
>> The dynamic feature status information should ideally be per-CPU
>> global, though we could have a separate copy per thread, at the cost
>> of more memory.
>
> Threads are migrated across CPUs so you can't rely on saying CPU0 has
> VFP powered up and CPU1 has VFP powered down, and then expect that
> threads using VFP will remain on CPU0. ?The system will spontaneously
> move that thread to CPU1 if CPU1 is less loaded than CPU0.

My theory was that this wouldn't matter -- the dynamic status contains
hints that this or that functional unit is likely to be in a "ready"
state.  It's stastically unlikely that the thread will be suspended or
migrated during a single execution of a particular function in most
cases; though of course it may happen sometimes.

If a thread tries to execute an instruction and and finds that
functional unit turned off, the kernel then makes a desicision about
whether to sleep the process for a bit, turn the feature on locally,
or migrate the thread.

> I think what may be possible is to hook VFP power state into the code
> which enables/disables access to VFP.

Indeed; I believe in some implementations that the SoC is clever
enough to save some power automatically when these features are
disabled (provided that the saving is non-destructive).

>
> However, I'm not aware of any platforms or CPUs where (eg) VFP is
> powered or clocked independently to the main CPU.
>

As I said above, the main use case I'm aware of would be NEON; it's
possible other vendors' extensions such as iwmmxt can also be managed
in similar, but this is outside my field of knowledge.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-03 16:28 RFC: Dynamic hwcaps Dave Martin
  2010-12-03 16:43 ` Jesse Barker
  2010-12-03 16:51 ` Russell King - ARM Linux
@ 2010-12-05 14:12 ` Thomas Petazzoni
  2010-12-06 10:56   ` Dave Martin
  2010-12-07 21:15 ` Ben Dooks
  3 siblings, 1 reply; 18+ messages in thread
From: Thomas Petazzoni @ 2010-12-05 14:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, 3 Dec 2010 16:28:27 +0000
Dave Martin <dave.martin@linaro.org> wrote:

> This allows for more active power management of such functional
> blocks: if the CPU is not fully loaded, you can turn them off -- the
> kernel can spot when there is significant idle time and do this.  If
> the CPU becomes fully loaded, applications which have soft-realtime
> constraints can notice this and switch to their accelerated code
> (which will cause the kernel to switch the functional unit(s) on).
> Or, the kernel can react to increasing CPU load by speculatively turn
> it on instead.  This is analogous to the behaviour of other power
> governors in the system.  Non-aware applications will still work
> seamlessly -- these may simply run accelerated code if the hardware
> supports it, causing the kernel to turn the affected functional
> block(s) on.

>From a power management perspective, is it really useful to load the
CPU instead of using specialized units which usually provide more
computing power per watt consumed ?

When the CPU is idle, it can enter sleep states to save power and let a
more specialized unit do the optimized work. For example, when doing
video decoding, probably specialized DSPs to a much better job from a
power management perspective than the CPU would do, so it's better to
keep the CPU idle and let the DSP do its video decoding job. No?

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-03 17:35   ` Dave Martin
@ 2010-12-05 15:14     ` Mark Mitchell
  2010-12-06 11:07       ` Dave Martin
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Mitchell @ 2010-12-05 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/3/2010 11:35 AM, Dave Martin wrote:

> What you describe is one of two mechanisms currently in use--- the
> other is for a single library to contain two implementations of
> certain functions and to choose between them based on the hwcaps.
> Typically, one set of functions is chosen a library initialisation
> time.  Some libraries, such as libpixman, are implementated this way;
> and it's often preferable since the the proportion of functions in a
> library which get significant benefit from special instruction set
> extensions is often pretty small.

I've believed for a long time that we should try to encourage this
approach.  The current approach (different libraries for each hardware
configuration) is prevalent, both in the toolchain ("multilibs") and in
other libraries -- but it seems to me premised on the idea that one is
building everything from source for one's particular hardware.  In the
earlier days of FOSS, the typical installation model was to download a
source tarball, build it, and install it on your local machine.  In that
context, tuning the library "just so" for your machine made sense.  But,
to enable binary distribution, having to have N copies of a library (let
alone an application) for N different ARM core variants just doesn't
make sense to me.

So, I certainly think that things like STT_GNU_IFUNC (which enable
determination of which routine to use at application start-up) make a
lot of sense.

I think your idea of exposing whether a unit is "ready", to allow even
more fine-grained choices as an application runs, is clever.  I don't
really know enough to say whether most applications could take advantage
of that.  One of the problems I see is that you need global information,
not local information.  In particular, if I'm using NEON to implement
the inner loop of some performance-critical application, then when the
unit is not ready, I want the kernel to wake it up already!  But, if I'm
just using NEON to do some random computation off the critical path, I'm
probably happy to do it slowly if that's more efficient than waking up
the NEON unit.  But, which of these cases I'm in isn't always locally
known at the point I'm doing the computation; the computation may be
buried in a small library routine.

Do we have good examples of applications that could profit from this
capability?

-- 
Mark Mitchell
CodeSourcery
mark at codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-05 14:12 ` Thomas Petazzoni
@ 2010-12-06 10:56   ` Dave Martin
  2010-12-08 11:01     ` Jamie Lokier
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Martin @ 2010-12-06 10:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 5, 2010 at 2:12 PM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Hi,
>
> On Fri, 3 Dec 2010 16:28:27 +0000
> Dave Martin <dave.martin@linaro.org> wrote:
>
>> This allows for more active power management of such functional
>> blocks: if the CPU is not fully loaded, you can turn them off -- the
>> kernel can spot when there is significant idle time and do this. ?If
>> the CPU becomes fully loaded, applications which have soft-realtime
>> constraints can notice this and switch to their accelerated code
>> (which will cause the kernel to switch the functional unit(s) on).
>> Or, the kernel can react to increasing CPU load by speculatively turn
>> it on instead. ?This is analogous to the behaviour of other power
>> governors in the system. ?Non-aware applications will still work
>> seamlessly -- these may simply run accelerated code if the hardware
>> supports it, causing the kernel to turn the affected functional
>> block(s) on.
>
> From a power management perspective, is it really useful to load the
> CPU instead of using specialized units which usually provide more
> computing power per watt consumed ?

No--- but you can't in general just exchange cycles on one functional
unit for cycles on another in the same way as you

Suppose 90% if your code (by execution time) can take advantage of a
specialised functional unit.  Should you turn that unit on?

Now, suppose only 5% of the code can take advantage, but the platform
is not completely busy.  Turning on a special functional unit consumes
extra power and will provide no speedup to the user -- is it still
worth turning it on?  What if the CPU is fully loaded doing other work
and your program is close to missing its realtime deadlines -- should
you turn on the separate unit now?

It not an easy thing to judge -- really, I'm just wondering whether
dynamic adaptation is feasible at all and whether it's worth
experimenting with...

> When the CPU is idle, it can enter sleep states to save power and let a
> more specialized unit do the optimized work. For example, when doing
> video decoding, probably specialized DSPs to a much better job from a
> power management perspective than the CPU would do, so it's better to
> keep the CPU idle and let the DSP do its video decoding job. No?

Often, definitely yes; however, it depends on various factors -- not
least, the software must have been ported to make use of the DSP in
order for this to be possible at all.

But the performance and power aspects are not trivial: separate DSP
units tend to have high setup and teardown costs, so as above, if the
total load on the DSP will be low, it may not be worth using it at all
from a power perspective; and using a DSP in the wrong way can also
lead to slower execution than doing everything on the CPU.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-05 15:14     ` Mark Mitchell
@ 2010-12-06 11:07       ` Dave Martin
  2010-12-07  1:02         ` Mark Mitchell
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Martin @ 2010-12-06 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 5, 2010 at 3:14 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 12/3/2010 11:35 AM, Dave Martin wrote:
>
>> What you describe is one of two mechanisms currently in use--- the
>> other is for a single library to contain two implementations of
>> certain functions and to choose between them based on the hwcaps.
>> Typically, one set of functions is chosen a library initialisation
>> time. ?Some libraries, such as libpixman, are implementated this way;
>> and it's often preferable since the the proportion of functions in a
>> library which get significant benefit from special instruction set
>> extensions is often pretty small.
>
> I've believed for a long time that we should try to encourage this
> approach. ?The current approach (different libraries for each hardware
> configuration) is prevalent, both in the toolchain ("multilibs") and in
> other libraries -- but it seems to me premised on the idea that one is
> building everything from source for one's particular hardware. ?In the
> earlier days of FOSS, the typical installation model was to download a
> source tarball, build it, and install it on your local machine. ?In that
> context, tuning the library "just so" for your machine made sense. ?But,
> to enable binary distribution, having to have N copies of a library (let
> alone an application) for N different ARM core variants just doesn't
> make sense to me.

Just so, and as discussed before improvements to package managers
could help here to avoid installing duplicate libraries.  (I believe
that rpm may have some capability here (?) but deb does not at
present).

> So, I certainly think that things like STT_GNU_IFUNC (which enable
> determination of which routine to use at application start-up) make a
> lot of sense.
>
> I think your idea of exposing whether a unit is "ready", to allow even
> more fine-grained choices as an application runs, is clever. ?I don't
> really know enough to say whether most applications could take advantage
> of that. ?One of the problems I see is that you need global information,
> not local information. ?In particular, if I'm using NEON to implement
> the inner loop of some performance-critical application, then when the
> unit is not ready, I want the kernel to wake it up already! ?But, if I'm
> just using NEON to do some random computation off the critical path, I'm
> probably happy to do it slowly if that's more efficient than waking up
> the NEON unit. ?But, which of these cases I'm in isn't always locally
> known at the point I'm doing the computation; the computation may be
> buried in a small library routine.

That's a fair concern -- I haven't explored the policy aspect much.
One possibility is that if the kernel sees system load nearing 100%,
it turns NEON on regardless.  But that's a pretty crude lever, and
might not bring a benefit if the software isn't able to use NEON.
Subtler approaches might involve the kernel collecting statistics on
applications' use of functional units, or some participation from
applications with realtime requirements.  Obviously, this is a but
fuzzy for now...

>
> Do we have good examples of applications that could profit from this
> capability?

Currently, I don't have many examples-- the main one is related to the
discussions aroung using NEON for memcpy().  This can be a performance
win on some platforms, but except when the system is heavily loaded,
or when NEON happens to be turned on anyway, it may not be
advantageous for the user or overall system performance.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-06 11:07       ` Dave Martin
@ 2010-12-07  1:02         ` Mark Mitchell
  2010-12-07 10:45           ` Dave Martin
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Mitchell @ 2010-12-07  1:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/6/2010 5:07 AM, Dave Martin wrote:

>> But,
>> to enable binary distribution, having to have N copies of a library (let
>> alone an application) for N different ARM core variants just doesn't
>> make sense to me.
> 
> Just so, and as discussed before improvements to package managers
> could help here to avoid installing duplicate libraries.  (I believe
> that rpm may have some capability here (?) but deb does not at
> present).

Yes, a smarter package manager could help a device builder automatically
get the right version of a library.  But, something more fundamental has
to happen to avoid the library developer having to *produce* N versions
of a library.  (Yes, in theory, you just type "make" with different
CFLAGS options, but in practice of course it's often more complex than
that, especially if you need to validate the library.)

> Currently, I don't have many examples-- the main one is related to the
> discussions aroung using NEON for memcpy().  This can be a performance
> win on some platforms, but except when the system is heavily loaded,
> or when NEON happens to be turned on anyway, it may not be
> advantageous for the user or overall system performance.

How good of a proxy would the length of the copy be, do you think?  If
you want to copy 1G of data, and NEON makes you 2x-4x faster, then it
seems to me that you probably want to use NEON, almost independent of
overall system load.  But, if you're only going to copy 16 bytes, even
if NEON is faster, it's probably OK not to use it -- the function-call
overhead to get into memcpy at all is probably significant relative to
the time you'd save by using NEON.  In between, it's harder, of course
-- but perhaps if memcpy is the key example, we could get 80% of the
benefit of your idea simply by a test inside memcpy as to the length of
the data to be copied?

-- 
Mark Mitchell
CodeSourcery
mark at codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-07  1:02         ` Mark Mitchell
@ 2010-12-07 10:45           ` Dave Martin
  2010-12-07 11:04             ` Russell King - ARM Linux
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Martin @ 2010-12-07 10:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Tue, Dec 7, 2010 at 1:02 AM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 12/6/2010 5:07 AM, Dave Martin wrote:
>
>>> But,
>>> to enable binary distribution, having to have N copies of a library (let
>>> alone an application) for N different ARM core variants just doesn't
>>> make sense to me.
>>
>> Just so, and as discussed before improvements to package managers
>> could help here to avoid installing duplicate libraries. ?(I believe
>> that rpm may have some capability here (?) but deb does not at
>> present).
>
> Yes, a smarter package manager could help a device builder automatically
> get the right version of a library. ?But, something more fundamental has
> to happen to avoid the library developer having to *produce* N versions
> of a library. ?(Yes, in theory, you just type "make" with different
> CFLAGS options, but in practice of course it's often more complex than
> that, especially if you need to validate the library.)

Yes-- though I didn't elaborate on it.  You need a packager that can
understand, say, that a binary built for ARMv5 EABI can interoperate
with ARMv7 binaries etc.
Again, I've heard it suggested that RPM can handle this, but I haven't
looked at it in detail myself.

>
>> Currently, I don't have many examples-- the main one is related to the
>> discussions aroung using NEON for memcpy(). ?This can be a performance
>> win on some platforms, but except when the system is heavily loaded,
>> or when NEON happens to be turned on anyway, it may not be
>> advantageous for the user or overall system performance.
>
> How good of a proxy would the length of the copy be, do you think? ?If
> you want to copy 1G of data, and NEON makes you 2x-4x faster, then it
> seems to me that you probably want to use NEON, almost independent of
> overall system load. ?But, if you're only going to copy 16 bytes, even
> if NEON is faster, it's probably OK not to use it -- the function-call
> overhead to get into memcpy at all is probably significant relative to
> the time you'd save by using NEON. ?In between, it's harder, of course
> -- but perhaps if memcpy is the key example, we could get 80% of the
> benefit of your idea simply by a test inside memcpy as to the length of
> the data to be copied?

For the memcpy() case, the answer is probably yes, though how often
memcpy is called by a given thread is also of significance.

However, there's still a problem: NEON is not designed for
implementing memcpy(), so there's no guarantee that it will always be
faster ... it is on some SoCs in some situations, but much less
beneficial on others -- the "sweet spots" both for performance and
power may differ widely from core to core and from SoC to SoC.  So
running benchmarks on one or two boards and then hard-compiling some
thresholds into glibc may not be the right approach.  Also, gcc
implements memcpy directly too for some cases (but only for small
copies?)

The dynamic hwcaps approach doesn't really solve that problem: for
adapting to different SoCs, you really want a way to run a benchmark
on the target to make your decision (xine-lib chooses an internal
memcpy implementation this way for example), or a way to pass some
platform metrics to glibc / other affected libraries.  Identifying the
precise SoC from /proc/cpuinfo isn't always straightforward, but I've
seen some code making use of it in similar ways.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-07 10:45           ` Dave Martin
@ 2010-12-07 11:04             ` Russell King - ARM Linux
  2010-12-07 15:06               ` Dave Martin
  0 siblings, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2010-12-07 11:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
> Yes-- though I didn't elaborate on it.  You need a packager that can
> understand, say, that a binary built for ARMv5 EABI can interoperate
> with ARMv7 binaries etc.
> Again, I've heard it suggested that RPM can handle this, but I haven't
> looked at it in detail myself.

That is indeed the case - as on x86, it used to be common to build the
majority of the distribution for i386, and glibc and a few other bits
for a range of ix86 CPUs.

rpm and yum know that i386 is compatible with i486, which is compatible
with i586 etc, so it will install an i386 package on i686 if no i486,
i586 or i686 package is available.

It does the same for ARM with ARMv3, ARMv4 etc.

> The dynamic hwcaps approach doesn't really solve that problem:

Has anyone investigated whether it is possible to power down things like
Neon etc meanwhile leaving the rest of the CPU running?  I've not seen
anything in the ARM documentation to suggest that's the case.

Even in MPCore based systems, the interface between the SCU and individual
processors by default doesn't have the necessary clamps built in to allow
individual CPUs to be powered off, and I'm not aware of any designs which
decided to enable this feature (as there's a performance penalty).  So I'd
be really surprised if there was any support for powering down Neon
separately from the host CPU.

If that's the case, it's entirely pointless discussing what userspace can
or can't do - if you have Neon available and can't power it down, and it's
faster for doing something, you might as well use it so you can put the
main CPU into WFI mode or get on with some other useful work.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-07 11:04             ` Russell King - ARM Linux
@ 2010-12-07 15:06               ` Dave Martin
  2010-12-07 15:21                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Martin @ 2010-12-07 15:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Tue, Dec 7, 2010 at 11:04 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
>> Yes-- though I didn't elaborate on it. ?You need a packager that can
>> understand, say, that a binary built for ARMv5 EABI can interoperate
>> with ARMv7 binaries etc.
>> Again, I've heard it suggested that RPM can handle this, but I haven't
>> looked at it in detail myself.
>
> That is indeed the case - as on x86, it used to be common to build the
> majority of the distribution for i386, and glibc and a few other bits
> for a range of ix86 CPUs.
>
> rpm and yum know that i386 is compatible with i486, which is compatible
> with i586 etc, so it will install an i386 package on i686 if no i486,
> i586 or i686 package is available.
>
> It does the same for ARM with ARMv3, ARMv4 etc.

That sounds plausible.  If you really want to go to town on this it
gets more complicated, but there's still a lot of value in modelling
the architectural development as a linear progression in this way.

>
>> The dynamic hwcaps approach doesn't really solve that problem:
>
> Has anyone investigated whether it is possible to power down things like
> Neon etc meanwhile leaving the rest of the CPU running? ?I've not seen
> anything in the ARM documentation to suggest that's the case.
>
> Even in MPCore based systems, the interface between the SCU and individual
> processors by default doesn't have the necessary clamps built in to allow
> individual CPUs to be powered off, and I'm not aware of any designs which
> decided to enable this feature (as there's a performance penalty).  So I'd
> be really surprised if there was any support for powering down Neon
> separately from the host CPU.

It's not part of the architecture per se, but some SoCs do put NEON in
a separate power domain and can power manage it somewhat indepedently.

However, I guess we need to clarify exactly how this works for SoCs in
practice.  If NEON and VFP are power-managed co-dependently (for
example, but no so likely(?)) this is not so useful to us ... since
it's becoming common to build everything with -mfpu=vfp*.

Because the kernel only uses FPEXC.EN to en/disable these extensions,
NEON and VFP are somewhat tied together ... though we might be able to
get more flexible by toggling the CPACR.ASEDIS control bit instead.  I
don't believe the kernel currently touches this (?)

>
> If that's the case, it's entirely pointless discussing what userspace can
> or can't do - if you have Neon available and can't power it down, and it's
> faster for doing something, you might as well use it so you can put the
> main CPU into WFI mode or get on with some other useful work.
>

Indeed ... my layman's understanding is that it is worth it on some
platforms, but I guess I need to clarify this with someone who
understands the hardware.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-07 15:06               ` Dave Martin
@ 2010-12-07 15:21                 ` Russell King - ARM Linux
  2010-12-07 15:36                   ` Dave Martin
  0 siblings, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2010-12-07 15:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 07, 2010 at 03:06:51PM +0000, Dave Martin wrote:
> Hi,
> 
> On Tue, Dec 7, 2010 at 11:04 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
> >> Yes-- though I didn't elaborate on it. ?You need a packager that can
> >> understand, say, that a binary built for ARMv5 EABI can interoperate
> >> with ARMv7 binaries etc.
> >> Again, I've heard it suggested that RPM can handle this, but I haven't
> >> looked at it in detail myself.
> >
> > That is indeed the case - as on x86, it used to be common to build the
> > majority of the distribution for i386, and glibc and a few other bits
> > for a range of ix86 CPUs.
> >
> > rpm and yum know that i386 is compatible with i486, which is compatible
> > with i586 etc, so it will install an i386 package on i686 if no i486,
> > i586 or i686 package is available.
> >
> > It does the same for ARM with ARMv3, ARMv4 etc.
> 
> That sounds plausible.

That sounds like doubt.

I've used rpm extensively over the last 10 years or so, both on x86 and
ARM.  I've built many versions of Red Hat and Fedora for ARM.  My ARM
machines here (including the one which is going to send this email) run
the result of that - and is currently a mixture of ARMv3 and ARMv4
Fedora packages.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-07 15:21                 ` Russell King - ARM Linux
@ 2010-12-07 15:36                   ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2010-12-07 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 7, 2010 at 3:21 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Tue, Dec 07, 2010 at 03:06:51PM +0000, Dave Martin wrote:
>> Hi,
>>
>> On Tue, Dec 7, 2010 at 11:04 AM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > On Tue, Dec 07, 2010 at 10:45:42AM +0000, Dave Martin wrote:
>> >> Yes-- though I didn't elaborate on it. ?You need a packager that can
>> >> understand, say, that a binary built for ARMv5 EABI can interoperate
>> >> with ARMv7 binaries etc.
>> >> Again, I've heard it suggested that RPM can handle this, but I haven't
>> >> looked at it in detail myself.
>> >
>> > That is indeed the case - as on x86, it used to be common to build the
>> > majority of the distribution for i386, and glibc and a few other bits
>> > for a range of ix86 CPUs.
>> >
>> > rpm and yum know that i386 is compatible with i486, which is compatible
>> > with i586 etc, so it will install an i386 package on i686 if no i486,
>> > i586 or i686 package is available.
>> >
>> > It does the same for ARM with ARMv3, ARMv4 etc.
>>
>> That sounds plausible.
>
> That sounds like doubt.
>
> I've used rpm extensively over the last 10 years or so, both on x86 and
> ARM. ?I've built many versions of Red Hat and Fedora for ARM. ?My ARM
> machines here (including the one which is going to send this email) run
> the result of that - and is currently a mixture of ARMv3 and ARMv4
> Fedora packages.

Only doubt in the sense that I don't have experience with it myself,
but I'm happy to take your word on it since you're more familiar with
rpm.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-03 16:28 RFC: Dynamic hwcaps Dave Martin
                   ` (2 preceding siblings ...)
  2010-12-05 14:12 ` Thomas Petazzoni
@ 2010-12-07 21:15 ` Ben Dooks
  2010-12-08 10:30   ` Dave Martin
  3 siblings, 1 reply; 18+ messages in thread
From: Ben Dooks @ 2010-12-07 21:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/12/10 16:28, Dave Martin wrote:
> Hi all,
> 
> I'd be interested in people's views on the following idea-- feel free
> to ignore if it doesn't interest you.
> 
> 
> For power-management purposes, it's useful to be able to turn off
> functional blocks on the SoC.
> 
> For on-SoC peripherals, this can be managed through the driver
> framework in the kernel, but for functional blocks of the CPU itself
> which are used by instruction set extensions, such as NEON or other
> media accelerators, it would be interesting if processes could adapt
> to these units appearing and disappearing at runtime.  This would mean
> that user processes would need to select dynamically between different
> implementations of accelerated functionality at runtime.
> 
> This allows for more active power management of such functional
> blocks: if the CPU is not fully loaded, you can turn them off -- the
> kernel can spot when there is significant idle time and do this.  If
> the CPU becomes fully loaded, applications which have soft-realtime
> constraints can notice this and switch to their accelerated code
> (which will cause the kernel to switch the functional unit(s) on).
> Or, the kernel can react to increasing CPU load by speculatively turn
> it on instead.  This is analogous to the behaviour of other power
> governors in the system.  Non-aware applications will still work
> seamlessly -- these may simply run accelerated code if the hardware
> supports it, causing the kernel to turn the affected functional
> block(s) on.
> 
> In order for this to work, some dynamic status information would need
> to be visible to each user process, and polled each time a function
> with a dynamically switchable choice of implementations gets called.
> You probably don't need to worry about race conditions either-- if the
> process accidentally tries to use a turned-off feature, you will take
> a fault which gives the kernel the chance to turn the feature back on.

Could you do what the original FP did, and start with units off and use
the first use of $unit in the process to turn it on? Do things like NEON
support this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-07 21:15 ` Ben Dooks
@ 2010-12-08 10:30   ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2010-12-08 10:30 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Tue, Dec 7, 2010 at 9:15 PM, Ben Dooks <ben-linux@fluff.org> wrote:

[...]

>
> Could you do what the original FP did, and start with units off and use
> the first use of $unit in the process to turn it on? Do things like NEON
> support this?
>

Actually, this is still done -- it's the same code since NEON and VFP
use a common register file.  The issue under discussion is that
userspace can't detect whether units like these are active or not, and
so can't make dynamic runtime decisions about whether to run
accelerated code or not. (And also, whether this would actually be
usefuil)

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-06 10:56   ` Dave Martin
@ 2010-12-08 11:01     ` Jamie Lokier
  2010-12-08 11:07       ` Dave Martin
  0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2010-12-08 11:01 UTC (permalink / raw)
  To: linux-arm-kernel

Dave Martin wrote:
> On Sun, Dec 5, 2010 at 2:12 PM, Thomas Petazzoni
> <thomas.petazzoni@free-electrons.com> wrote:
> > Hi,
> >
> > On Fri, 3 Dec 2010 16:28:27 +0000
> > Dave Martin <dave.martin@linaro.org> wrote:
> >
> >> This allows for more active power management of such functional
> >> blocks: if the CPU is not fully loaded, you can turn them off -- the
> >> kernel can spot when there is significant idle time and do this. ?If
> >> the CPU becomes fully loaded, applications which have soft-realtime
> >> constraints can notice this and switch to their accelerated code
> >> (which will cause the kernel to switch the functional unit(s) on).
> >> Or, the kernel can react to increasing CPU load by speculatively turn
> >> it on instead. ?This is analogous to the behaviour of other power
> >> governors in the system. ?Non-aware applications will still work
> >> seamlessly -- these may simply run accelerated code if the hardware
> >> supports it, causing the kernel to turn the affected functional
> >> block(s) on.
> >
> > From a power management perspective, is it really useful to load the
> > CPU instead of using specialized units which usually provide more
> > computing power per watt consumed ?
> 
> No--- but you can't in general just exchange cycles on one functional
> unit for cycles on another in the same way as you
> 
> Suppose 90% if your code (by execution time) can take advantage of a
> specialised functional unit.  Should you turn that unit on?
> 
> Now, suppose only 5% of the code can take advantage, but the platform
> is not completely busy.  Turning on a special functional unit consumes
> extra power and will provide no speedup to the user -- is it still
> worth turning it on?  What if the CPU is fully loaded doing other work
> and your program is close to missing its realtime deadlines -- should
> you turn on the separate unit now?

I think Thomas's point is that doing the 5% on the CPU may consume
more power than turning on the special functional unit - even when
the system is not busy and the user doesn't see a time difference.

I don't know if that's true for available hardware, but it seems like
it's worth investigating before taking the idea further.

-- Jamie

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC: Dynamic hwcaps
  2010-12-08 11:01     ` Jamie Lokier
@ 2010-12-08 11:07       ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2010-12-08 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 8, 2010 at 11:01 AM, Jamie Lokier <jamie@shareable.org> wrote:
> Dave Martin wrote:
>> On Sun, Dec 5, 2010 at 2:12 PM, Thomas Petazzoni
>> <thomas.petazzoni@free-electrons.com> wrote:
>> > Hi,
>> >
>> > On Fri, 3 Dec 2010 16:28:27 +0000
>> > Dave Martin <dave.martin@linaro.org> wrote:
>> >
>> >> This allows for more active power management of such functional
>> >> blocks: if the CPU is not fully loaded, you can turn them off -- the
>> >> kernel can spot when there is significant idle time and do this. ?If
>> >> the CPU becomes fully loaded, applications which have soft-realtime
>> >> constraints can notice this and switch to their accelerated code
>> >> (which will cause the kernel to switch the functional unit(s) on).
>> >> Or, the kernel can react to increasing CPU load by speculatively turn
>> >> it on instead. ?This is analogous to the behaviour of other power
>> >> governors in the system. ?Non-aware applications will still work
>> >> seamlessly -- these may simply run accelerated code if the hardware
>> >> supports it, causing the kernel to turn the affected functional
>> >> block(s) on.
>> >
>> > From a power management perspective, is it really useful to load the
>> > CPU instead of using specialized units which usually provide more
>> > computing power per watt consumed ?
>>
>> No--- but you can't in general just exchange cycles on one functional
>> unit for cycles on another in the same way as you
>>
>> Suppose 90% if your code (by execution time) can take advantage of a
>> specialised functional unit. ?Should you turn that unit on?
>>
>> Now, suppose only 5% of the code can take advantage, but the platform
>> is not completely busy. ?Turning on a special functional unit consumes
>> extra power and will provide no speedup to the user -- is it still
>> worth turning it on? ?What if the CPU is fully loaded doing other work
>> and your program is close to missing its realtime deadlines -- should
>> you turn on the separate unit now?
>
> I think Thomas's point is that doing the 5% on the CPU may consume
> more power than turning on the special functional unit - even when
> the system is not busy and the user doesn't see a time difference.
>
> I don't know if that's true for available hardware, but it seems like
> it's worth investigating before taking the idea further.

Agreed -- either could be the case.  It's something you can never be
certain about without doing some measurements...

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-12-08 11:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-03 16:28 RFC: Dynamic hwcaps Dave Martin
2010-12-03 16:43 ` Jesse Barker
2010-12-03 16:51 ` Russell King - ARM Linux
2010-12-03 17:35   ` Dave Martin
2010-12-05 15:14     ` Mark Mitchell
2010-12-06 11:07       ` Dave Martin
2010-12-07  1:02         ` Mark Mitchell
2010-12-07 10:45           ` Dave Martin
2010-12-07 11:04             ` Russell King - ARM Linux
2010-12-07 15:06               ` Dave Martin
2010-12-07 15:21                 ` Russell King - ARM Linux
2010-12-07 15:36                   ` Dave Martin
2010-12-05 14:12 ` Thomas Petazzoni
2010-12-06 10:56   ` Dave Martin
2010-12-08 11:01     ` Jamie Lokier
2010-12-08 11:07       ` Dave Martin
2010-12-07 21:15 ` Ben Dooks
2010-12-08 10:30   ` Dave Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).