linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* floating point support in the driver.
@ 2008-08-01 10:57 Misbah khan
  2008-08-01 11:32 ` Laurent Pinchart
  0 siblings, 1 reply; 9+ messages in thread
From: Misbah khan @ 2008-08-01 10:57 UTC (permalink / raw)
  To: linuxppc-embedded


Hi all,

I have a DSP algorithm which i am running in the application even after
enabling the VFP support it is taking a lot of time to get executed hence 

I want to transform the same into the driver insted of an user application.
Can anybody suggest whether doing the same could be a better solution and
what could be the chalenges that i have to face by implimenting such
floating point support in the driver.

Is there a way in the application itself to make it execute faster.

---- Misbah <>< 
-- 
View this message in context: http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18772109.html
Sent from the linuxppc-embedded mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-01 10:57 floating point support in the driver Misbah khan
@ 2008-08-01 11:32 ` Laurent Pinchart
  2008-08-01 12:00   ` Misbah khan
  0 siblings, 1 reply; 9+ messages in thread
From: Laurent Pinchart @ 2008-08-01 11:32 UTC (permalink / raw)
  To: linuxppc-embedded; +Cc: Misbah khan

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

On Friday 01 August 2008, Misbah khan wrote:
> 
> Hi all,
> 
> I have a DSP algorithm which i am running in the application even after
> enabling the VFP support it is taking a lot of time to get executed hence 
> 
> I want to transform the same into the driver insted of an user application.
> Can anybody suggest whether doing the same could be a better solution and
> what could be the chalenges that i have to face by implimenting such
> floating point support in the driver.
> 
> Is there a way in the application itself to make it execute faster.

Floating-point in the kernel should be avoided. FPU state save/restore operations are costly and are not performed by the kernel when switching from userspace to kernelspace context. You will have to protect floating-point sections with kernel_fpu_begin/kernel_fpu_end which, if I'm not mistaken, disables preemption. That's probably not something you want to do. Why would the same code run faster in kernelspace then userspace ?

-- 
Laurent Pinchart
CSE Semaphore Belgium

Chaussee de Bruxelles, 732A
B-1410 Waterloo
Belgium

T +32 (2) 387 42 59
F +32 (2) 387 42 75

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-01 11:32 ` Laurent Pinchart
@ 2008-08-01 12:00   ` Misbah khan
  2008-08-01 15:54     ` M. Warner Losh
  0 siblings, 1 reply; 9+ messages in thread
From: Misbah khan @ 2008-08-01 12:00 UTC (permalink / raw)
  To: linuxppc-embedded


I am not very clear Why floating point support in the Kernel should be
avoided ?

We want our DSP algorithm to run at the boot time and since kernel thread
having higher priority , i assume that it would be faster than user
application.

If i really have to speed up my application execution what mechanism will
you suggest me to try ?

After using Hardware VFP support also i am still laging the timing
requirement by 800 ms in my case 

---- Misbah <><


Laurent Pinchart-4 wrote:
> 
> On Friday 01 August 2008, Misbah khan wrote:
>> 
>> Hi all,
>> 
>> I have a DSP algorithm which i am running in the application even after
>> enabling the VFP support it is taking a lot of time to get executed hence 
>> 
>> I want to transform the same into the driver insted of an user
>> application.
>> Can anybody suggest whether doing the same could be a better solution and
>> what could be the chalenges that i have to face by implimenting such
>> floating point support in the driver.
>> 
>> Is there a way in the application itself to make it execute faster.
> 
> Floating-point in the kernel should be avoided. FPU state save/restore
> operations are costly and are not performed by the kernel when switching
> from userspace to kernelspace context. You will have to protect
> floating-point sections with kernel_fpu_begin/kernel_fpu_end which, if I'm
> not mistaken, disables preemption. That's probably not something you want
> to do. Why would the same code run faster in kernelspace then userspace ?
> 
> -- 
> Laurent Pinchart
> CSE Semaphore Belgium
> 
> Chaussee de Bruxelles, 732A
> B-1410 Waterloo
> Belgium
> 
> T +32 (2) 387 42 59
> F +32 (2) 387 42 75
> 
>  
> _______________________________________________
> Linuxppc-embedded mailing list
> Linuxppc-embedded@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-embedded
> 

-- 
View this message in context: http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18772952.html
Sent from the linuxppc-embedded mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-01 12:00   ` Misbah khan
@ 2008-08-01 15:54     ` M. Warner Losh
  2008-08-04  5:23       ` Misbah khan
  0 siblings, 1 reply; 9+ messages in thread
From: M. Warner Losh @ 2008-08-01 15:54 UTC (permalink / raw)
  To: misbah_khan; +Cc: linuxppc-embedded

In message: <18772952.post@talk.nabble.com>
            Misbah khan <misbah_khan@engineer.com> writes:
: I am not very clear Why floating point support in the Kernel should be
: avoided ?

Because saving the FPU state is expensive.  The kernel multiplexes the
FPU hardware among all the userland processes that use it.  For parts
of the kernel to effectively use the FPU, it would have to save the
state on traps into the kernel, and restore the state when returning
to userland.  This is a big drag on performance of the system.  There
are ways around this optimization where you save the fpu state
explicitly, but the expense si still there.

: We want our DSP algorithm to run at the boot time and since kernel thread
: having higher priority , i assume that it would be faster than user
: application.

Bad assumption.  User threads can get boots in priority in certain
cases.

If it really is just at boot time, before any other threads are
started, you likely can get away with it.

: If i really have to speed up my application execution what mechanism will
: you suggest me to try ?
: 
: After using Hardware VFP support also i am still laging the timing
: requirement by 800 ms in my case 

This sounds like a classic case of putting 20 pounds in a 10 pound bag
and complaining that the bag rips out.  You need a bigger bag.

If you are doing FPU intensive operations in userland, moving them to
the kernel isn't going to help anything but maybe latency.  And if you
are almost a full second short, your quest to move things into the
kernel is almost certainly not going to help enough.  Moving things
into the kernel only helps latency, and only when there's lots of
context switches (since doing stuff in the kernel avoids the domain
crossing that forces the save of the CPU state).

I don't know if the 800ms timing is relative to a task that must run
once a second, or once an hour.  If the former, you're totally
screwed and need to either be more clever about your algorithm
(consider integer math, profiling the hot spots, etc), or you need
more powerful silicon.  If you  are trying to shave 800ms off a task
that runs for an hour, then you just might be able to do that with
tiny code tweaks.

Sorry to be so harsh, but really, there's no such thing as a free lunch.

Warner



: ---- Misbah <><
: 
: 
: Laurent Pinchart-4 wrote:
: > 
: > On Friday 01 August 2008, Misbah khan wrote:
: >> 
: >> Hi all,
: >> 
: >> I have a DSP algorithm which i am running in the application even after
: >> enabling the VFP support it is taking a lot of time to get executed hence 
: >> 
: >> I want to transform the same into the driver insted of an user
: >> application.
: >> Can anybody suggest whether doing the same could be a better solution and
: >> what could be the chalenges that i have to face by implimenting such
: >> floating point support in the driver.
: >> 
: >> Is there a way in the application itself to make it execute faster.
: > 
: > Floating-point in the kernel should be avoided. FPU state save/restore
: > operations are costly and are not performed by the kernel when switching
: > from userspace to kernelspace context. You will have to protect
: > floating-point sections with kernel_fpu_begin/kernel_fpu_end which, if I'm
: > not mistaken, disables preemption. That's probably not something you want
: > to do. Why would the same code run faster in kernelspace then userspace ?
: > 
: > -- 
: > Laurent Pinchart
: > CSE Semaphore Belgium
: > 
: > Chaussee de Bruxelles, 732A
: > B-1410 Waterloo
: > Belgium
: > 
: > T +32 (2) 387 42 59
: > F +32 (2) 387 42 75
: > 
: >  
: > _______________________________________________
: > Linuxppc-embedded mailing list
: > Linuxppc-embedded@ozlabs.org
: > https://ozlabs.org/mailman/listinfo/linuxppc-embedded
: > 
: 
: -- 
: View this message in context: http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18772952.html
: Sent from the linuxppc-embedded mailing list archive at Nabble.com.
: 
: _______________________________________________
: Linuxppc-embedded mailing list
: Linuxppc-embedded@ozlabs.org
: https://ozlabs.org/mailman/listinfo/linuxppc-embedded
: 
: 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-01 15:54     ` M. Warner Losh
@ 2008-08-04  5:23       ` Misbah khan
  2008-08-04  5:33         ` M. Warner Losh
  0 siblings, 1 reply; 9+ messages in thread
From: Misbah khan @ 2008-08-04  5:23 UTC (permalink / raw)
  To: linuxppc-embedded


Thank you Warner.

Actually the complete algorithm should take not more than 1 sec to execute
but its taking around 1.8 sec .The algorithm would rum between every few
secs. I am trying to fine tune the code ,i just want to know that will it a
good idea to alter the task priority and what could be the best way ?

-- Misbah <>< 

M. Warner Losh wrote:
> 
> In message: <18772952.post@talk.nabble.com>
>             Misbah khan <misbah_khan@engineer.com> writes:
> : I am not very clear Why floating point support in the Kernel should be
> : avoided ?
> 
> Because saving the FPU state is expensive.  The kernel multiplexes the
> FPU hardware among all the userland processes that use it.  For parts
> of the kernel to effectively use the FPU, it would have to save the
> state on traps into the kernel, and restore the state when returning
> to userland.  This is a big drag on performance of the system.  There
> are ways around this optimization where you save the fpu state
> explicitly, but the expense si still there.
> 
> : We want our DSP algorithm to run at the boot time and since kernel
> thread
> : having higher priority , i assume that it would be faster than user
> : application.
> 
> Bad assumption.  User threads can get boots in priority in certain
> cases.
> 
> If it really is just at boot time, before any other threads are
> started, you likely can get away with it.
> 
> : If i really have to speed up my application execution what mechanism
> will
> : you suggest me to try ?
> : 
> : After using Hardware VFP support also i am still laging the timing
> : requirement by 800 ms in my case 
> 
> This sounds like a classic case of putting 20 pounds in a 10 pound bag
> and complaining that the bag rips out.  You need a bigger bag.
> 
> If you are doing FPU intensive operations in userland, moving them to
> the kernel isn't going to help anything but maybe latency.  And if you
> are almost a full second short, your quest to move things into the
> kernel is almost certainly not going to help enough.  Moving things
> into the kernel only helps latency, and only when there's lots of
> context switches (since doing stuff in the kernel avoids the domain
> crossing that forces the save of the CPU state).
> 
> I don't know if the 800ms timing is relative to a task that must run
> once a second, or once an hour.  If the former, you're totally
> screwed and need to either be more clever about your algorithm
> (consider integer math, profiling the hot spots, etc), or you need
> more powerful silicon.  If you  are trying to shave 800ms off a task
> that runs for an hour, then you just might be able to do that with
> tiny code tweaks.
> 
> Sorry to be so harsh, but really, there's no such thing as a free lunch.
> 
> Warner
> 
> 
> 
> : ---- Misbah <><
> : 
> : 
> : Laurent Pinchart-4 wrote:
> : > 
> : > On Friday 01 August 2008, Misbah khan wrote:
> : >> 
> : >> Hi all,
> : >> 
> : >> I have a DSP algorithm which i am running in the application even
> after
> : >> enabling the VFP support it is taking a lot of time to get executed
> hence 
> : >> 
> : >> I want to transform the same into the driver insted of an user
> : >> application.
> : >> Can anybody suggest whether doing the same could be a better solution
> and
> : >> what could be the chalenges that i have to face by implimenting such
> : >> floating point support in the driver.
> : >> 
> : >> Is there a way in the application itself to make it execute faster.
> : > 
> : > Floating-point in the kernel should be avoided. FPU state save/restore
> : > operations are costly and are not performed by the kernel when
> switching
> : > from userspace to kernelspace context. You will have to protect
> : > floating-point sections with kernel_fpu_begin/kernel_fpu_end which, if
> I'm
> : > not mistaken, disables preemption. That's probably not something you
> want
> : > to do. Why would the same code run faster in kernelspace then
> userspace ?
> : > 
> : > -- 
> : > Laurent Pinchart
> : > CSE Semaphore Belgium
> : > 
> : > Chaussee de Bruxelles, 732A
> : > B-1410 Waterloo
> : > Belgium
> : > 
> : > T +32 (2) 387 42 59
> : > F +32 (2) 387 42 75
> : > 
> : >  
> : > _______________________________________________
> : > Linuxppc-embedded mailing list
> : > Linuxppc-embedded@ozlabs.org
> : > https://ozlabs.org/mailman/listinfo/linuxppc-embedded
> : > 
> : 
> : -- 
> : View this message in context:
> http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18772952.html
> : Sent from the linuxppc-embedded mailing list archive at Nabble.com.
> : 
> : _______________________________________________
> : Linuxppc-embedded mailing list
> : Linuxppc-embedded@ozlabs.org
> : https://ozlabs.org/mailman/listinfo/linuxppc-embedded
> : 
> : 
> _______________________________________________
> Linuxppc-embedded mailing list
> Linuxppc-embedded@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-embedded
> 
> 

-- 
View this message in context: http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18805820.html
Sent from the linuxppc-embedded mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-04  5:23       ` Misbah khan
@ 2008-08-04  5:33         ` M. Warner Losh
  2008-08-04  5:47           ` David Hawkins
  0 siblings, 1 reply; 9+ messages in thread
From: M. Warner Losh @ 2008-08-04  5:33 UTC (permalink / raw)
  To: misbah_khan; +Cc: linuxppc-embedded

In message: <18805820.post@talk.nabble.com>
            Misbah khan <misbah_khan@engineer.com> writes:
: Actually the complete algorithm should take not more than 1 sec to execute
: but its taking around 1.8 sec .The algorithm would rum between every few
: secs. I am trying to fine tune the code ,i just want to know that will it a
: good idea to alter the task priority and what could be the best way ?

You could try a very high priority task, but I'd suggest that
profiling the code to see where the hot spots are might yield better
results...  Have you identified what other process is running for the
those two seconds that's causing your <1s algorithm to take about 2x
as long?  What's the real vs cpu time say for this algorithm?  If they
are about the same, then you have to make it faster.

Given that you are looking for a factor of 2x, my experience suggests
that moving this into the kernel is unlikely to be successful and will
be a lot of pain.  It would be rare indeed to find a system that
context switches really account for that much of the time.

To make progress, you need to identify the real root cause for this
slowdown.  Either your thread really is taking the extra time, in
which case profiling and algorithm improvement is your only
alternative.  Or someone else is eating all the CPU, and you must
either hold them off, or get a beefier CPU.

Boosting the priority might be a good diagnostic aide, but may have
unintended side effects if you really are competing with something
else.  Wouldn't that starve the other process?  What is it doing?

Warner


: -- Misbah <>< 
: 
: M. Warner Losh wrote:
: > 
: > In message: <18772952.post@talk.nabble.com>
: >             Misbah khan <misbah_khan@engineer.com> writes:
: > : I am not very clear Why floating point support in the Kernel should be
: > : avoided ?
: > 
: > Because saving the FPU state is expensive.  The kernel multiplexes the
: > FPU hardware among all the userland processes that use it.  For parts
: > of the kernel to effectively use the FPU, it would have to save the
: > state on traps into the kernel, and restore the state when returning
: > to userland.  This is a big drag on performance of the system.  There
: > are ways around this optimization where you save the fpu state
: > explicitly, but the expense si still there.
: > 
: > : We want our DSP algorithm to run at the boot time and since kernel
: > thread
: > : having higher priority , i assume that it would be faster than user
: > : application.
: > 
: > Bad assumption.  User threads can get boots in priority in certain
: > cases.
: > 
: > If it really is just at boot time, before any other threads are
: > started, you likely can get away with it.
: > 
: > : If i really have to speed up my application execution what mechanism
: > will
: > : you suggest me to try ?
: > : 
: > : After using Hardware VFP support also i am still laging the timing
: > : requirement by 800 ms in my case 
: > 
: > This sounds like a classic case of putting 20 pounds in a 10 pound bag
: > and complaining that the bag rips out.  You need a bigger bag.
: > 
: > If you are doing FPU intensive operations in userland, moving them to
: > the kernel isn't going to help anything but maybe latency.  And if you
: > are almost a full second short, your quest to move things into the
: > kernel is almost certainly not going to help enough.  Moving things
: > into the kernel only helps latency, and only when there's lots of
: > context switches (since doing stuff in the kernel avoids the domain
: > crossing that forces the save of the CPU state).
: > 
: > I don't know if the 800ms timing is relative to a task that must run
: > once a second, or once an hour.  If the former, you're totally
: > screwed and need to either be more clever about your algorithm
: > (consider integer math, profiling the hot spots, etc), or you need
: > more powerful silicon.  If you  are trying to shave 800ms off a task
: > that runs for an hour, then you just might be able to do that with
: > tiny code tweaks.
: > 
: > Sorry to be so harsh, but really, there's no such thing as a free lunch.
: > 
: > Warner
: > 
: > 
: > 
: > : ---- Misbah <><
: > : 
: > : 
: > : Laurent Pinchart-4 wrote:
: > : > 
: > : > On Friday 01 August 2008, Misbah khan wrote:
: > : >> 
: > : >> Hi all,
: > : >> 
: > : >> I have a DSP algorithm which i am running in the application even
: > after
: > : >> enabling the VFP support it is taking a lot of time to get executed
: > hence 
: > : >> 
: > : >> I want to transform the same into the driver insted of an user
: > : >> application.
: > : >> Can anybody suggest whether doing the same could be a better solution
: > and
: > : >> what could be the chalenges that i have to face by implimenting such
: > : >> floating point support in the driver.
: > : >> 
: > : >> Is there a way in the application itself to make it execute faster.
: > : > 
: > : > Floating-point in the kernel should be avoided. FPU state save/restore
: > : > operations are costly and are not performed by the kernel when
: > switching
: > : > from userspace to kernelspace context. You will have to protect
: > : > floating-point sections with kernel_fpu_begin/kernel_fpu_end which, if
: > I'm
: > : > not mistaken, disables preemption. That's probably not something you
: > want
: > : > to do. Why would the same code run faster in kernelspace then
: > userspace ?
: > : > 
: > : > -- 
: > : > Laurent Pinchart
: > : > CSE Semaphore Belgium
: > : > 
: > : > Chaussee de Bruxelles, 732A
: > : > B-1410 Waterloo
: > : > Belgium
: > : > 
: > : > T +32 (2) 387 42 59
: > : > F +32 (2) 387 42 75
: > : > 
: > : >  
: > : > _______________________________________________
: > : > Linuxppc-embedded mailing list
: > : > Linuxppc-embedded@ozlabs.org
: > : > https://ozlabs.org/mailman/listinfo/linuxppc-embedded
: > : > 
: > : 
: > : -- 
: > : View this message in context:
: > http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18772952.html
: > : Sent from the linuxppc-embedded mailing list archive at Nabble.com.
: > : 
: > : _______________________________________________
: > : Linuxppc-embedded mailing list
: > : Linuxppc-embedded@ozlabs.org
: > : https://ozlabs.org/mailman/listinfo/linuxppc-embedded
: > : 
: > : 
: > _______________________________________________
: > Linuxppc-embedded mailing list
: > Linuxppc-embedded@ozlabs.org
: > https://ozlabs.org/mailman/listinfo/linuxppc-embedded
: > 
: > 
: 
: -- 
: View this message in context: http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18805820.html
: Sent from the linuxppc-embedded mailing list archive at Nabble.com.
: 
: _______________________________________________
: Linuxppc-embedded mailing list
: Linuxppc-embedded@ozlabs.org
: https://ozlabs.org/mailman/listinfo/linuxppc-embedded
: 
: 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-04  5:33         ` M. Warner Losh
@ 2008-08-04  5:47           ` David Hawkins
  2008-08-05  9:49             ` Misbah khan
  0 siblings, 1 reply; 9+ messages in thread
From: David Hawkins @ 2008-08-04  5:47 UTC (permalink / raw)
  To: M. Warner Losh; +Cc: misbah_khan, linuxppc-embedded


Hi Misbah,

I would recommend you look at your floating-point code again
and benchmark each section. You should be able to estimate
the number of clock cycles required to complete an operation
and then check that against your measurements.

Depending on whether your algorithm is processing intensive
or data movement intensive, you may find that the big time
waster is moving data on or off chip, or perhaps its a large
vector operation that is blowing out the cache. If you
do find that, then on some processors you can lock the
cache, so your algorithm would require a custom driver
that steals part of the cache from the OS, but the floating point
code would not run in the kernel, it would run on data
stored in the stolen cache area. You can lock both instructions
and data in the cache; eg. an FFT routine can be locked in
the instruction cache, while FFT data is in the data cache.
I'm not sure how easy this is to do under Linux though.

Here's an example of the level of detail you can get
downto when benchmarking code:

http://www.ovro.caltech.edu/~dwh/correlator/pdf/dsp_programming.pdf

The FFT routine used on this processor made use of both
the instruction and data cache (on-chip SRAM) on the
DSP.

This code is being re-developed to run on a MPC8349EA PowerPC
with FPU. I did some initial testing to confirm that the
FPU operates as per the data sheet, and will eventually get
around to more complete testing.

Which processor were you running your code on, and what
frequency were you operating the processor at? How does
the algorithm timing compare when run on other processors,
eg. your desktop or laptop machine?

Cheers,
Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-04  5:47           ` David Hawkins
@ 2008-08-05  9:49             ` Misbah khan
  2008-08-05 16:53               ` David Hawkins
  0 siblings, 1 reply; 9+ messages in thread
From: Misbah khan @ 2008-08-05  9:49 UTC (permalink / raw)
  To: linuxppc-embedded


Hi David ,

Thank you for your reply.

I am running the algorithm on OMAP processor (arm-core) and i did tried the
same on iMX processor which takes 1.7 times more than OMAP.

It is true that the algorithm is performing the vector operation which is
blowing the cache .

But the question is How to lock the cache ? In driver how should we
implement the same ?

An example code or a document could be helpful in this regard.

--- Misbah <><

David Hawkins-3 wrote:
> 
> 
> Hi Misbah,
> 
> I would recommend you look at your floating-point code again
> and benchmark each section. You should be able to estimate
> the number of clock cycles required to complete an operation
> and then check that against your measurements.
> 
> Depending on whether your algorithm is processing intensive
> or data movement intensive, you may find that the big time
> waster is moving data on or off chip, or perhaps its a large
> vector operation that is blowing out the cache. If you
> do find that, then on some processors you can lock the
> cache, so your algorithm would require a custom driver
> that steals part of the cache from the OS, but the floating point
> code would not run in the kernel, it would run on data
> stored in the stolen cache area. You can lock both instructions
> and data in the cache; eg. an FFT routine can be locked in
> the instruction cache, while FFT data is in the data cache.
> I'm not sure how easy this is to do under Linux though.
> 
> Here's an example of the level of detail you can get
> downto when benchmarking code:
> 
> http://www.ovro.caltech.edu/~dwh/correlator/pdf/dsp_programming.pdf
> 
> The FFT routine used on this processor made use of both
> the instruction and data cache (on-chip SRAM) on the
> DSP.
> 
> This code is being re-developed to run on a MPC8349EA PowerPC
> with FPU. I did some initial testing to confirm that the
> FPU operates as per the data sheet, and will eventually get
> around to more complete testing.
> 
> Which processor were you running your code on, and what
> frequency were you operating the processor at? How does
> the algorithm timing compare when run on other processors,
> eg. your desktop or laptop machine?
> 
> Cheers,
> Dave
> _______________________________________________
> Linuxppc-embedded mailing list
> Linuxppc-embedded@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-embedded
> 
> 

-- 
View this message in context: http://www.nabble.com/floating-point-support-in-the-driver.-tp18772109p18827857.html
Sent from the linuxppc-embedded mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: floating point support in the driver.
  2008-08-05  9:49             ` Misbah khan
@ 2008-08-05 16:53               ` David Hawkins
  0 siblings, 0 replies; 9+ messages in thread
From: David Hawkins @ 2008-08-05 16:53 UTC (permalink / raw)
  To: Misbah khan; +Cc: linuxppc-embedded

Hi Misbah,

> I am running the algorithm on OMAP processor (arm-core)
> and i did tried the same on iMX processor which
> takes 1.7 times more than OMAP.

Ok, thats a 10,000ft benchmark. The observation being
that it fails your requirement.

How does that time compare to the operations
required, and their expected times?

> It is true that the algorithm is performing the vector
> operation which is blowing the cache.

Determined how? Obviously if your cache is 16K and your
data is 64K, there's no way it'll fit in there at once,
but the algorithm could be crafted such that 1K at a time
was processed, while another data packet was moved onto
the cache ... but this is very processor specific.

> But the question is How to lock the cache ? In driver
> how should we implement the same ?
> 
> An example code or a document could be helpful in this regard.

Indeed :)

I have no idea how the OMAP works, so the following are
just random, and possibly incorrect ramblings ...

The MPC8349EA startup code uses a trick where it zeros
out sections of the cache while providing an address.
Once the addresses and zeros are in the cache, its locked.
 From that point on, memory accesses to those addresses
result in cache 'hits'. This is the startup stack used
by the U-Boot bootloader.

If something similar was done under Linux, then *I guess*
you could implement mmap() and ioremap() the section of
addresses associated with the locked cache lines.
You could then DMA data to and from the cache area,
and run your algorithm there. That would provide you
'fast SRAM'.

However, you might be able to get the same effect by
setting up your processing algorithm such that it handled
smaller chunks of data.

Feel free to explain your data processing :)

Cheers,
Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-08-05 17:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-01 10:57 floating point support in the driver Misbah khan
2008-08-01 11:32 ` Laurent Pinchart
2008-08-01 12:00   ` Misbah khan
2008-08-01 15:54     ` M. Warner Losh
2008-08-04  5:23       ` Misbah khan
2008-08-04  5:33         ` M. Warner Losh
2008-08-04  5:47           ` David Hawkins
2008-08-05  9:49             ` Misbah khan
2008-08-05 16:53               ` David Hawkins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).