* udelay() broken for SMP cores?
@ 2010-04-21 2:19 Saravana Kannan
2010-04-21 4:56 ` Shilimkar, Santosh
0 siblings, 1 reply; 19+ messages in thread
From: Saravana Kannan @ 2010-04-21 2:19 UTC (permalink / raw)
To: linux-arm-kernel
Fixed the subject and starting a new thread. Please ignore the previous
email with an incorrect subject.
Hi,
This is the first time I'm mailing LAKML, so please excuse me if I do
something silly (looks like I already did!).
I looked at arch/arm/lib/delay.S and it looks like __udelay and
__const_udelay won't work correctly for SMP cores. The code just uses
the loops_per_jiffy variable instead of the per-CPU loops per jiffy data.
Is anyone already working on a fix for that? If not, I can fix it up in
a way that's hopefully palatable to the community.
Thanks,
Saravana
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 2:19 udelay() broken for SMP cores? Saravana Kannan
@ 2010-04-21 4:56 ` Shilimkar, Santosh
2010-04-21 6:43 ` Saravana Kannan
0 siblings, 1 reply; 19+ messages in thread
From: Shilimkar, Santosh @ 2010-04-21 4:56 UTC (permalink / raw)
To: linux-arm-kernel
> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-kernel-
> bounces at lists.infradead.org] On Behalf Of Saravana Kannan
> Sent: Wednesday, April 21, 2010 7:50 AM
> To: linux-arm-kernel at lists.infradead.org
> Subject: udelay() broken for SMP cores?
>
> Fixed the subject and starting a new thread. Please ignore the previous
> email with an incorrect subject.
>
> Hi,
>
> This is the first time I'm mailing LAKML, so please excuse me if I do
> something silly (looks like I already did!).
>
> I looked at arch/arm/lib/delay.S and it looks like __udelay and
> __const_udelay won't work correctly for SMP cores. The code just uses
> the loops_per_jiffy variable instead of the per-CPU loops per jiffy data.
>
> Is anyone already working on a fix for that? If not, I can fix it up in
> a way that's hopefully palatable to the community.
Two questions :
1. Why do you think the udelay is broken ?? Did you measure the time ??
2. Have you arrived at this conclusion based on BOGO MIPS value ??
If you are correlating BIGOMIPS with udelay, then I am sure you are going
Get it wrong because of the aggressive nature of newer ARM's where the branch
loop in the udelay is always hit and instead of fetching it from cache/memory,
its taken from pipe line itself.
Ofcourse you point of using the per cpu "loops_per_jiffy" variable may be valid, but
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 4:56 ` Shilimkar, Santosh
@ 2010-04-21 6:43 ` Saravana Kannan
2010-04-21 7:22 ` Russell King - ARM Linux
0 siblings, 1 reply; 19+ messages in thread
From: Saravana Kannan @ 2010-04-21 6:43 UTC (permalink / raw)
To: linux-arm-kernel
>> I looked at arch/arm/lib/delay.S and it looks like __udelay and
>> __const_udelay won't work correctly for SMP cores. The code just uses
>> the loops_per_jiffy variable instead of the per-CPU loops per jiffy data.
>>
>> Is anyone already working on a fix for that? If not, I can fix it up in
>> a way that's hopefully palatable to the community.
>
> If you have case where individual CPUs are running at different speed and different
> Tick rate then only this can make difference.
Yes. I was talking about the case where the CPUs could be running at
different speeds.
Thanks,
Saravana
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 6:43 ` Saravana Kannan
@ 2010-04-21 7:22 ` Russell King - ARM Linux
2010-04-21 9:39 ` skannan at codeaurora.org
0 siblings, 1 reply; 19+ messages in thread
From: Russell King - ARM Linux @ 2010-04-21 7:22 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Apr 20, 2010 at 11:43:23PM -0700, Saravana Kannan wrote:
>
>>> I looked at arch/arm/lib/delay.S and it looks like __udelay and
>>> __const_udelay won't work correctly for SMP cores. The code just uses
>>> the loops_per_jiffy variable instead of the per-CPU loops per jiffy data.
>>>
>>> Is anyone already working on a fix for that? If not, I can fix it up in
>>> a way that's hopefully palatable to the community.
>>
>
>> If you have case where individual CPUs are running at different speed and different
>> Tick rate then only this can make difference.
>
> Yes. I was talking about the case where the CPUs could be running at
> different speeds.
We don't support that; if we did, we'd have to disable preempt for every
call to mdelay/udelay to ensure that the thread is locked to a particular
CPU. I suspect that will (a) destroy RT scheduling preformance (b)
increase preempt latency to an undesirable extent.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 7:22 ` Russell King - ARM Linux
@ 2010-04-21 9:39 ` skannan at codeaurora.org
2010-04-21 9:50 ` Russell King - ARM Linux
0 siblings, 1 reply; 19+ messages in thread
From: skannan at codeaurora.org @ 2010-04-21 9:39 UTC (permalink / raw)
To: linux-arm-kernel
> On Tue, Apr 20, 2010 at 11:43:23PM -0700, Saravana Kannan wrote:
>>
>>>> I looked at arch/arm/lib/delay.S and it looks like __udelay and
>>>> __const_udelay won't work correctly for SMP cores. The code just uses
>>>> the loops_per_jiffy variable instead of the per-CPU loops per jiffy
>>>> data.
>>>>
>>>> Is anyone already working on a fix for that? If not, I can fix it up
>>>> in
>>>> a way that's hopefully palatable to the community.
>>>
>>
>>> If you have case where individual CPUs are running at different speed
>>> and different
>>> Tick rate then only this can make difference.
>>
>> Yes. I was talking about the case where the CPUs could be running at
>> different speeds.
>
> We don't support that; if we did, we'd have to disable preempt for every
> call to mdelay/udelay to ensure that the thread is locked to a particular
> CPU. I suspect that will (a) destroy RT scheduling preformance (b)
> increase preempt latency to an undesirable extent.
>
Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling
and x86 udelay uses per-CPU data. So your concern should apply for x86
too. I had the same concern and was planning on bring it up in the cpufreq
mailing list after I made sure I didn't misunderstand anything.
Btw, your concern should apply for single core scaling too, right? Context
switch can complete within max udelay (general - 5ms, ARM - 2ms) time and
CPU could have jumped
from lowest to highest speed in that time and mess up udelay. I didn't see
any code in cpufreq that deferred scaling during udelay. So, that's something
I plan to ask cpufreq folks too.
Let me know what you think.
Thanks,
Saravana
P.S: Sent from phone. Pls excuse formatting issues.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 9:39 ` skannan at codeaurora.org
@ 2010-04-21 9:50 ` Russell King - ARM Linux
2010-04-21 9:58 ` Gilles Chanteperdrix
` (3 more replies)
0 siblings, 4 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2010-04-21 9:50 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Apr 21, 2010 at 02:39:39AM -0700, skannan at codeaurora.org wrote:
> Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling
> and x86 udelay uses per-CPU data. So your concern should apply for x86
> too. I had the same concern and was planning on bring it up in the cpufreq
> mailing list after I made sure I didn't misunderstand anything.
Well, x86 looks buggy in this regard as well - the loops_per_jiffy
value used is for a CPU which may not run the delay loop.
> Btw, your concern should apply for single core scaling too, right? Context
> switch can complete within max udelay (general - 5ms, ARM - 2ms) time and
> CPU could have jumped
> from lowest to highest speed in that time and mess up udelay. I didn't see
> any code in cpufreq that deferred scaling during udelay. So, that's something
> I plan to ask cpufreq folks too.
Well, the assumption is that the CPUs will be running at their fastest
speed at boot time, and therefore loops_per_jiffy will be calibrated
such that we guarantee _at least_ the asked-for delay - which is the
only guarantee udelay has.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 9:50 ` Russell King - ARM Linux
@ 2010-04-21 9:58 ` Gilles Chanteperdrix
2010-04-21 10:00 ` Jamie Lokier
` (2 subsequent siblings)
3 siblings, 0 replies; 19+ messages in thread
From: Gilles Chanteperdrix @ 2010-04-21 9:58 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Wed, Apr 21, 2010 at 02:39:39AM -0700, skannan at codeaurora.org wrote:
>> Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling
>> and x86 udelay uses per-CPU data. So your concern should apply for x86
>> too. I had the same concern and was planning on bring it up in the cpufreq
>> mailing list after I made sure I didn't misunderstand anything.
>
> Well, x86 looks buggy in this regard as well - the loops_per_jiffy
> value used is for a CPU which may not run the delay loop.
It looks to me like x86 with a tsc use the tsc for udelay, and handle
cpu changes correctly. See the function delay_tsc in
arch/x86/lib/delay.c. However, they do not seem to handle frequency
changes that well.
--
Gilles.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 9:50 ` Russell King - ARM Linux
2010-04-21 9:58 ` Gilles Chanteperdrix
@ 2010-04-21 10:00 ` Jamie Lokier
2010-04-21 19:29 ` Russell King - ARM Linux
2010-04-21 10:31 ` skannan at codeaurora.org
2010-04-23 9:00 ` Pavel Machek
3 siblings, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2010-04-21 10:00 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> Well, the assumption is that the CPUs will be running at their fastest
> speed at boot time, and therefore loops_per_jiffy will be calibrated
> such that we guarantee _at least_ the asked-for delay - which is the
> only guarantee udelay has.
That's an interesting and not altogether reliable assumption.
On a device I'm working with, we just read a fixed-speed clock
register in a loop. It's slower than the CPU register loop, but given
udelay counts in great big slow _microsecond_ delays (how quaint! ;-)
that's fine.
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 9:50 ` Russell King - ARM Linux
2010-04-21 9:58 ` Gilles Chanteperdrix
2010-04-21 10:00 ` Jamie Lokier
@ 2010-04-21 10:31 ` skannan at codeaurora.org
2010-04-21 19:33 ` Russell King - ARM Linux
2010-04-23 9:00 ` Pavel Machek
3 siblings, 1 reply; 19+ messages in thread
From: skannan at codeaurora.org @ 2010-04-21 10:31 UTC (permalink / raw)
To: linux-arm-kernel
> Well, the assumption is that the CPUs will be running at their fastest
> speed at boot time, and therefore loops_per_jiffy will be calibrated
> such that we guarantee _at least_ the asked-for delay - which is the
> only guarantee udelay has.
Even if the boot assumption is true, cpufreq actively changes the
loops_per_jiffy value when it changes freq. So, this could still mess up
the _at least_ guarantee.
I think it's right for cpufreq to scale lpj with cpu speed to avoid udelay
from taking longer by several magnitudes. But we also need to avoid
breaking the guarantee. May be we can postpose the freq change if a
udelay loop is active.
Thanks,
Saravana
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 10:00 ` Jamie Lokier
@ 2010-04-21 19:29 ` Russell King - ARM Linux
2010-04-21 19:52 ` Jamie Lokier
0 siblings, 1 reply; 19+ messages in thread
From: Russell King - ARM Linux @ 2010-04-21 19:29 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Apr 21, 2010 at 11:00:08AM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > Well, the assumption is that the CPUs will be running at their fastest
> > speed at boot time, and therefore loops_per_jiffy will be calibrated
> > such that we guarantee _at least_ the asked-for delay - which is the
> > only guarantee udelay has.
>
> That's an interesting and not altogether reliable assumption.
That depends which bit you're talking about. udelay() must give you the
delay you asked for, or a longer delay. If it gives you a shorter delay,
it's buggy plain and simple.
> On a device I'm working with, we just read a fixed-speed clock
> register in a loop. It's slower than the CPU register loop, but given
> udelay counts in great big slow _microsecond_ delays (how quaint! ;-)
> that's fine.
We could go to ns delays, but then we have a big problem - the cost of
calculating the number of loops starts to become significant compared to
the delays - and that's a quality of implementation factor. In fact,
the existing cost has always been significant for short delays for
slower (sub-100MHz) ARMs.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 10:31 ` skannan at codeaurora.org
@ 2010-04-21 19:33 ` Russell King - ARM Linux
2010-04-21 23:47 ` Saravana Kannan
0 siblings, 1 reply; 19+ messages in thread
From: Russell King - ARM Linux @ 2010-04-21 19:33 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Apr 21, 2010 at 03:31:03AM -0700, skannan at codeaurora.org wrote:
> > Well, the assumption is that the CPUs will be running at their fastest
> > speed at boot time, and therefore loops_per_jiffy will be calibrated
> > such that we guarantee _at least_ the asked-for delay - which is the
> > only guarantee udelay has.
>
> Even if the boot assumption is true, cpufreq actively changes the
> loops_per_jiffy value when it changes freq. So, this could still mess up
> the _at least_ guarantee.
Actually, it doesn't on SMP - if you build the kernel with SMP enabled,
the code which fiddles with loops_per_jiffy is disabled. See the
#ifndef wrapping around adjust_jiffies() in drivers/cpufreq/cpufreq.c.
So, on SMP with cpufreq, the global loops_per_jiffy is a fixed value.
Provided it was calibrated with the CPU running at max clock rate,
the guarantee is satisfied for all CPUs in the system.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 19:29 ` Russell King - ARM Linux
@ 2010-04-21 19:52 ` Jamie Lokier
2010-04-21 20:21 ` Russell King - ARM Linux
0 siblings, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2010-04-21 19:52 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Wed, Apr 21, 2010 at 11:00:08AM +0100, Jamie Lokier wrote:
> > Russell King - ARM Linux wrote:
> > > Well, the assumption is that the CPUs will be running at their fastest
> > > speed at boot time, and therefore loops_per_jiffy will be calibrated
> > > such that we guarantee _at least_ the asked-for delay - which is the
> > > only guarantee udelay has.
> >
> > That's an interesting and not altogether reliable assumption.
>
> That depends which bit you're talking about. udelay() must give you the
> delay you asked for, or a longer delay. If it gives you a shorter delay,
> it's buggy plain and simple.
>
> > On a device I'm working with, we just read a fixed-speed clock
> > register in a loop. It's slower than the CPU register loop, but given
> > udelay counts in great big slow _microsecond_ delays (how quaint! ;-)
> > that's fine.
>
> We could go to ns delays, but then we have a big problem - the cost of
> calculating the number of loops starts to become significant compared to
> the delays - and that's a quality of implementation factor. In fact,
> the existing cost has always been significant for short delays for
> slower (sub-100MHz) ARMs.
I'm surprised it makes much difference to, say, 20MHz ARMs because you
could structure it as a nested loop, the inner one executed once per
microsecond and calibrated to 1us. The delays don't have to be super
accurate.
With a fixed-speed clock register known at compile time, the
calculation tends to constant-fold nicely, even for ns delays. Not
suitable for multi-target kernels but good on single-target.
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 19:52 ` Jamie Lokier
@ 2010-04-21 20:21 ` Russell King - ARM Linux
2010-04-21 20:47 ` Jamie Lokier
0 siblings, 1 reply; 19+ messages in thread
From: Russell King - ARM Linux @ 2010-04-21 20:21 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Apr 21, 2010 at 08:52:25PM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > We could go to ns delays, but then we have a big problem - the cost of
> > calculating the number of loops starts to become significant compared to
> > the delays - and that's a quality of implementation factor. In fact,
> > the existing cost has always been significant for short delays for
> > slower (sub-100MHz) ARMs.
>
> I'm surprised it makes much difference to, say, 20MHz ARMs because you
> could structure it as a nested loop, the inner one executed once per
> microsecond and calibrated to 1us. The delays don't have to be super
> accurate.
You don't understand the issue. On older ARMs, the single 32-bit
multiply is not cheap; it shows up as having a significant time
expense for very short delays - and that _does_ matter.
Consider system performance where you're driving a bus using udelay()
to provide 1us timings, but udelay ends up taking 10us instead every
time because of the calculation for number of loops for a 1us timing.
> With a fixed-speed clock register known at compile time, the
> calculation tends to constant-fold nicely, even for ns delays. Not
> suitable for multi-target kernels but good on single-target.
Here you're making a very big assumption - that there's some register
you can read which is regularly clocked. That's not true on a lot of
older ARMs, where we struggle to satisfy sched_clock() due to lack of
such a register.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 20:21 ` Russell King - ARM Linux
@ 2010-04-21 20:47 ` Jamie Lokier
2010-04-21 20:57 ` Russell King - ARM Linux
0 siblings, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2010-04-21 20:47 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Wed, Apr 21, 2010 at 08:52:25PM +0100, Jamie Lokier wrote:
> > Russell King - ARM Linux wrote:
> > > We could go to ns delays, but then we have a big problem - the cost of
> > > calculating the number of loops starts to become significant compared to
> > > the delays - and that's a quality of implementation factor. In fact,
> > > the existing cost has always been significant for short delays for
> > > slower (sub-100MHz) ARMs.
> >
> > I'm surprised it makes much difference to, say, 20MHz ARMs because you
> > could structure it as a nested loop, the inner one executed once per
> > microsecond and calibrated to 1us. The delays don't have to be super
> > accurate.
>
> You don't understand the issue. On older ARMs, the single 32-bit
> multiply is not cheap; it shows up as having a significant time
> expense for very short delays - and that _does_ matter.
>
> Consider system performance where you're driving a bus using udelay()
> to provide 1us timings, but udelay ends up taking 10us instead every
> time because of the calculation for number of loops for a 1us timing.
Hence nested loop. You don't multiply. No calculation.
> > With a fixed-speed clock register known at compile time, the
> > calculation tends to constant-fold nicely, even for ns delays. Not
> > suitable for multi-target kernels but good on single-target.
>
> Here you're making a very big assumption - that there's some register
> you can read which is regularly clocked. That's not true on a lot of
> older ARMs, where we struggle to satisfy sched_clock() due to lack of
> such a register.
Yes, I know. I'm lucky to have one :-)
Where there is one, it seems like a good idea to use it if it's fast to read.
-- Jamie
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 20:47 ` Jamie Lokier
@ 2010-04-21 20:57 ` Russell King - ARM Linux
2010-04-22 0:14 ` Jamie Lokier
0 siblings, 1 reply; 19+ messages in thread
From: Russell King - ARM Linux @ 2010-04-21 20:57 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Apr 21, 2010 at 09:47:18PM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > You don't understand the issue. On older ARMs, the single 32-bit
> > multiply is not cheap; it shows up as having a significant time
> > expense for very short delays - and that _does_ matter.
> >
> > Consider system performance where you're driving a bus using udelay()
> > to provide 1us timings, but udelay ends up taking 10us instead every
> > time because of the calculation for number of loops for a 1us timing.
>
> Hence nested loop. You don't multiply. No calculation.
Ok, since you seem to have a clear idea how to convert this into a double
nested loop, try converting it:
@ 0 <= r0 <= 0x7fffff06
ldr r2, .LC0 (loops_per_jiffy)
ldr r2, [r2] @ max = 0x01ffffff
mov r0, r0, lsr #14 @ max = 0x0001ffff
mov r2, r2, lsr #10 @ max = 0x00007fff
mul r0, r2, r0 @ max = 2^32-1
movs r0, r0, lsr #6
moveq pc, lr
1: subs r0, r0, #1
bhi 1b
mov pc, lr
into two loops without losing the precision - note that the multiply
is part of a 'dividing by multiply+shift' technique.
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 19:33 ` Russell King - ARM Linux
@ 2010-04-21 23:47 ` Saravana Kannan
0 siblings, 0 replies; 19+ messages in thread
From: Saravana Kannan @ 2010-04-21 23:47 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Wed, Apr 21, 2010 at 03:31:03AM -0700, skannan at codeaurora.org wrote:
>>> Well, the assumption is that the CPUs will be running at their fastest
>>> speed at boot time, and therefore loops_per_jiffy will be calibrated
>>> such that we guarantee _at least_ the asked-for delay - which is the
>>> only guarantee udelay has.
>> Even if the boot assumption is true, cpufreq actively changes the
>> loops_per_jiffy value when it changes freq. So, this could still mess up
>> the _at least_ guarantee.
>
> Actually, it doesn't on SMP - if you build the kernel with SMP enabled,
> the code which fiddles with loops_per_jiffy is disabled. See the
> #ifndef wrapping around adjust_jiffies() in drivers/cpufreq/cpufreq.c.
My comment above was for the non-SMP case (it was a reply to your
comment about non-SMP case). In non-SMP case, cpufreq changes LPJ and
the freq switch can happen while udelay is looping. That would mess up
the minimum delay guarantee of udelay.
I was aware that cpufreq doesn't change LPJ for SMP. But I think they do
that because they don't know where the arch specific per-CPU
loops_per_jiffy is located. They expect the cpufreq driver to do the lpj
scaling. So, per-CPU lpj is still going to change. At least, that's what
I took out of the following comment:
/*
* This function alters the system "loops_per_jiffy" for the clock
* speed change. Note that loops_per_jiffy cannot be updated on SMP
* systems as each CPU might be scaled differently. So, use the arch
* per-CPU loops_per_jiffy value wherever possible.
*/
> So, on SMP with cpufreq, the global loops_per_jiffy is a fixed value.
> Provided it was calibrated with the CPU running at max clock rate,
> the guarantee is satisfied for all CPUs in the system.
As mentioned earlier, I think the cpufreq driver for that specific arch
is supposed to handle the LPJ changes. But let's assume that's not true.
So, wouldn't this still be a problem? You could be doing udelay as if
you are running at 1 GHz but you are actually running at 100 MHz. I
would think that would be bad for performance and power (wasting cycles
without going into WFI, etc).
Thanks,
Saravana
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 20:57 ` Russell King - ARM Linux
@ 2010-04-22 0:14 ` Jamie Lokier
2011-01-08 23:24 ` Russell King - ARM Linux
0 siblings, 1 reply; 19+ messages in thread
From: Jamie Lokier @ 2010-04-22 0:14 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> > > Consider system performance where you're driving a bus using udelay()
> > > to provide 1us timings, but udelay ends up taking 10us instead every
> > > time because of the calculation for number of loops for a 1us timing.
> >
> > Hence nested loop. You don't multiply. No calculation.
>
> Ok, since you seem to have a clear idea how to convert this into a double
> nested loop, try converting it:
>
> @ 0 <= r0 <= 0x7fffff06
> ldr r2, .LC0 (loops_per_jiffy)
> ldr r2, [r2] @ max = 0x01ffffff
> mov r0, r0, lsr #14 @ max = 0x0001ffff
> mov r2, r2, lsr #10 @ max = 0x00007fff
> mul r0, r2, r0 @ max = 2^32-1
> movs r0, r0, lsr #6
> moveq pc, lr
> 1: subs r0, r0, #1
> bhi 1b
> mov pc, lr
>
> into two loops without losing the precision - note that the multiply
> is part of a 'dividing by multiply+shift' technique.
ldr r2, loops_per_jiffy
ldr r3, microseconds_per_jiffy
mov r4, r2
1: subs r4, r4, r3
bhi 1b
subs r0, r0, #1
add r4, r4, r2
bhi 1b
mov pc, lr
Goodnight :)
- Jamie
(Admission: I wasn't thinking of high precision when I glibly said two
loops; your challenge prompted me to work it out, and I was pleasantly
surprised to see it come out so neatly.)
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-21 9:50 ` Russell King - ARM Linux
` (2 preceding siblings ...)
2010-04-21 10:31 ` skannan at codeaurora.org
@ 2010-04-23 9:00 ` Pavel Machek
3 siblings, 0 replies; 19+ messages in thread
From: Pavel Machek @ 2010-04-23 9:00 UTC (permalink / raw)
To: linux-arm-kernel
Hi!
> > Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling
> > and x86 udelay uses per-CPU data. So your concern should apply for x86
> > too. I had the same concern and was planning on bring it up in the cpufreq
> > mailing list after I made sure I didn't misunderstand anything.
>
> Well, x86 looks buggy in this regard as well - the loops_per_jiffy
> value used is for a CPU which may not run the delay loop.
x86 assumes all cores to be essentially same.
> > Btw, your concern should apply for single core scaling too, right? Context
> > switch can complete within max udelay (general - 5ms, ARM - 2ms) time and
> > CPU could have jumped
> > from lowest to highest speed in that time and mess up udelay. I didn't see
> > any code in cpufreq that deferred scaling during udelay. So, that's something
> > I plan to ask cpufreq folks too.
>
> Well, the assumption is that the CPUs will be running at their fastest
> speed at boot time, and therefore loops_per_jiffy will be calibrated
> such that we guarantee _at least_ the asked-for delay - which is the
> only guarantee udelay has.
Well, some machines can't reach max cpu speed on battery power, so
there may be a problem there.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores?
2010-04-22 0:14 ` Jamie Lokier
@ 2011-01-08 23:24 ` Russell King - ARM Linux
0 siblings, 0 replies; 19+ messages in thread
From: Russell King - ARM Linux @ 2011-01-08 23:24 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Apr 22, 2010 at 01:14:17AM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > Ok, since you seem to have a clear idea how to convert this into a double
> > nested loop, try converting it:
> >
> > @ 0 <= r0 <= 0x7fffff06
> > ldr r2, .LC0 (loops_per_jiffy)
> > ldr r2, [r2] @ max = 0x01ffffff
> > mov r0, r0, lsr #14 @ max = 0x0001ffff
> > mov r2, r2, lsr #10 @ max = 0x00007fff
> > mul r0, r2, r0 @ max = 2^32-1
> > movs r0, r0, lsr #6
> > moveq pc, lr
> > 1: subs r0, r0, #1
> > bhi 1b
> > mov pc, lr
> >
> > into two loops without losing the precision - note that the multiply
> > is part of a 'dividing by multiply+shift' technique.
>
> ldr r2, loops_per_jiffy
> ldr r3, microseconds_per_jiffy
> mov r4, r2
> 1: subs r4, r4, r3
> bhi 1b
> subs r0, r0, #1
> add r4, r4, r2
> bhi 1b
> mov pc, lr
>
> Goodnight :)
I thought I'd dig this out and give it a go - but it has problems. Let's
say usec_per_jiffy is 10000. Initially, loops_per_jiffy is 1<<12 or 4096
at boot.
If udelay() is used prior to calibration (it is - see things like OMAP/8250
console drivers which use udelay(1)), the initial loops_per_jiffy value
will be used.
So, r0 = 10000. r3 = 10000. r2 = 4096.
mov r4, r2 @ r4 := 4096
1: subs r4, r4, r3 @ r4 -= 10000 := -5904
bhi 1b @ not taken
subs r0, r0, #1 @ r0 -= 1 := 9999
add r4, r4, r2 @ r4 += 4096 := -1808 (or 4294965488)
bhi 1b @ taken
That's the first iteration. The next iteration:
1: subs r4, r4, r3 @ r4 -= 10000 := 4294955488
bhi 1b @ taken
1: subs r4, r4, r3 @ r4 -= 10000 := 4294945488
bhi 1b @ taken
... which means we have about 429493 loops to go ...
So this becomes an extremely slow loop - it works when loops_per_jiffy >
usec_per_jiffy.
Even with a value of 8192 (the first tried lpj in the calibration loop),
things eventually go wrong - r4 on each iteration goes -1808, -3616, ..
-9040 and then we're into the problem above - and this will be the case
for anyone with HZ=100.
So, this solution has undesirable behaviours... and this is what I've
come up with - we manually increase the lpj and decrease the required
delay by a power of two until we meet the necessary preconditions
(lpj >= usec/jiffy).
.LC0: .long loops_per_jiffy
.long (1000000 + (HZ / 2))/HZ
ENTRY(__delay)
ldr r3, .LC0 + 4 @ usec/jiffy
mov r2, r0
mov r0, r3
b 2f
ENTRY(__udelay)
ldr r2, .LC0
ldr r3, .LC0 + 4 @ usec/jiffy
ldr r2, [r2] @ lpj
2: cmp r2, r3
movcc r2, r2, lsl #1
movcc r0, r0, lsr #1
bcc 2b
mov ip, r2
1: subs ip, ip, r3
addls ip, ip, r2
sublss r0, r0, #2
bhi 1b
mov pc, lr
Note that I've also tweaked the loop a little to make the cycle count
(in theory) around the loop the same no matter what it does.
This way, I get the same lpj calibration value as the old way - which is
good as with the old way, we were calibrating just this loop:
1: subs r0, r0, #1
bhi 1b
mov pc, lr
where r0 = lpj and the target delay time was 1 jiffy.
Now, what sparked this off was:
> > We could go to ns delays, but then we have a big problem - the cost of
> > calculating the number of loops starts to become significant compared to
> > the delays - and that's a quality of implementation factor. In fact,
> > the existing cost has always been significant for short delays for
> > slower (sub-100MHz) ARMs.
>
> I'm surprised it makes much difference to, say, 20MHz ARMs because you
> could structure it as a nested loop, the inner one executed once per
> microsecond and calibrated to 1us. The delays don't have to be super
> accurate.
With this nested loop approach we can't go to ns resolution.
nsec_per_jiffy would be 10000000, and with an initial loops_per_jiffy
of 4096 or 8192, this would be extremely bad.
That said, I do think your approach has merit - especially as we're
now seeing CPUs in the 2000 BogoMips range, and our existing solution
goes bad at 3355 BogoMips. As the board I have is something like 8
months old we've probably got what, 10 months left according to Moore's
law?
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2011-01-08 23:24 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-21 2:19 udelay() broken for SMP cores? Saravana Kannan
2010-04-21 4:56 ` Shilimkar, Santosh
2010-04-21 6:43 ` Saravana Kannan
2010-04-21 7:22 ` Russell King - ARM Linux
2010-04-21 9:39 ` skannan at codeaurora.org
2010-04-21 9:50 ` Russell King - ARM Linux
2010-04-21 9:58 ` Gilles Chanteperdrix
2010-04-21 10:00 ` Jamie Lokier
2010-04-21 19:29 ` Russell King - ARM Linux
2010-04-21 19:52 ` Jamie Lokier
2010-04-21 20:21 ` Russell King - ARM Linux
2010-04-21 20:47 ` Jamie Lokier
2010-04-21 20:57 ` Russell King - ARM Linux
2010-04-22 0:14 ` Jamie Lokier
2011-01-08 23:24 ` Russell King - ARM Linux
2010-04-21 10:31 ` skannan at codeaurora.org
2010-04-21 19:33 ` Russell King - ARM Linux
2010-04-21 23:47 ` Saravana Kannan
2010-04-23 9:00 ` Pavel Machek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).