* udelay() broken for SMP cores? @ 2010-04-21 2:19 Saravana Kannan 2010-04-21 4:56 ` Shilimkar, Santosh 0 siblings, 1 reply; 19+ messages in thread From: Saravana Kannan @ 2010-04-21 2:19 UTC (permalink / raw) To: linux-arm-kernel Fixed the subject and starting a new thread. Please ignore the previous email with an incorrect subject. Hi, This is the first time I'm mailing LAKML, so please excuse me if I do something silly (looks like I already did!). I looked at arch/arm/lib/delay.S and it looks like __udelay and __const_udelay won't work correctly for SMP cores. The code just uses the loops_per_jiffy variable instead of the per-CPU loops per jiffy data. Is anyone already working on a fix for that? If not, I can fix it up in a way that's hopefully palatable to the community. Thanks, Saravana ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 2:19 udelay() broken for SMP cores? Saravana Kannan @ 2010-04-21 4:56 ` Shilimkar, Santosh 2010-04-21 6:43 ` Saravana Kannan 0 siblings, 1 reply; 19+ messages in thread From: Shilimkar, Santosh @ 2010-04-21 4:56 UTC (permalink / raw) To: linux-arm-kernel > -----Original Message----- > From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-kernel- > bounces at lists.infradead.org] On Behalf Of Saravana Kannan > Sent: Wednesday, April 21, 2010 7:50 AM > To: linux-arm-kernel at lists.infradead.org > Subject: udelay() broken for SMP cores? > > Fixed the subject and starting a new thread. Please ignore the previous > email with an incorrect subject. > > Hi, > > This is the first time I'm mailing LAKML, so please excuse me if I do > something silly (looks like I already did!). > > I looked at arch/arm/lib/delay.S and it looks like __udelay and > __const_udelay won't work correctly for SMP cores. The code just uses > the loops_per_jiffy variable instead of the per-CPU loops per jiffy data. > > Is anyone already working on a fix for that? If not, I can fix it up in > a way that's hopefully palatable to the community. Two questions : 1. Why do you think the udelay is broken ?? Did you measure the time ?? 2. Have you arrived at this conclusion based on BOGO MIPS value ?? If you are correlating BIGOMIPS with udelay, then I am sure you are going Get it wrong because of the aggressive nature of newer ARM's where the branch loop in the udelay is always hit and instead of fetching it from cache/memory, its taken from pipe line itself. Ofcourse you point of using the per cpu "loops_per_jiffy" variable may be valid, but ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 4:56 ` Shilimkar, Santosh @ 2010-04-21 6:43 ` Saravana Kannan 2010-04-21 7:22 ` Russell King - ARM Linux 0 siblings, 1 reply; 19+ messages in thread From: Saravana Kannan @ 2010-04-21 6:43 UTC (permalink / raw) To: linux-arm-kernel >> I looked at arch/arm/lib/delay.S and it looks like __udelay and >> __const_udelay won't work correctly for SMP cores. The code just uses >> the loops_per_jiffy variable instead of the per-CPU loops per jiffy data. >> >> Is anyone already working on a fix for that? If not, I can fix it up in >> a way that's hopefully palatable to the community. > > If you have case where individual CPUs are running at different speed and different > Tick rate then only this can make difference. Yes. I was talking about the case where the CPUs could be running at different speeds. Thanks, Saravana ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 6:43 ` Saravana Kannan @ 2010-04-21 7:22 ` Russell King - ARM Linux 2010-04-21 9:39 ` skannan at codeaurora.org 0 siblings, 1 reply; 19+ messages in thread From: Russell King - ARM Linux @ 2010-04-21 7:22 UTC (permalink / raw) To: linux-arm-kernel On Tue, Apr 20, 2010 at 11:43:23PM -0700, Saravana Kannan wrote: > >>> I looked at arch/arm/lib/delay.S and it looks like __udelay and >>> __const_udelay won't work correctly for SMP cores. The code just uses >>> the loops_per_jiffy variable instead of the per-CPU loops per jiffy data. >>> >>> Is anyone already working on a fix for that? If not, I can fix it up in >>> a way that's hopefully palatable to the community. >> > >> If you have case where individual CPUs are running at different speed and different >> Tick rate then only this can make difference. > > Yes. I was talking about the case where the CPUs could be running at > different speeds. We don't support that; if we did, we'd have to disable preempt for every call to mdelay/udelay to ensure that the thread is locked to a particular CPU. I suspect that will (a) destroy RT scheduling preformance (b) increase preempt latency to an undesirable extent. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 7:22 ` Russell King - ARM Linux @ 2010-04-21 9:39 ` skannan at codeaurora.org 2010-04-21 9:50 ` Russell King - ARM Linux 0 siblings, 1 reply; 19+ messages in thread From: skannan at codeaurora.org @ 2010-04-21 9:39 UTC (permalink / raw) To: linux-arm-kernel > On Tue, Apr 20, 2010 at 11:43:23PM -0700, Saravana Kannan wrote: >> >>>> I looked at arch/arm/lib/delay.S and it looks like __udelay and >>>> __const_udelay won't work correctly for SMP cores. The code just uses >>>> the loops_per_jiffy variable instead of the per-CPU loops per jiffy >>>> data. >>>> >>>> Is anyone already working on a fix for that? If not, I can fix it up >>>> in >>>> a way that's hopefully palatable to the community. >>> >> >>> If you have case where individual CPUs are running at different speed >>> and different >>> Tick rate then only this can make difference. >> >> Yes. I was talking about the case where the CPUs could be running at >> different speeds. > > We don't support that; if we did, we'd have to disable preempt for every > call to mdelay/udelay to ensure that the thread is locked to a particular > CPU. I suspect that will (a) destroy RT scheduling preformance (b) > increase preempt latency to an undesirable extent. > Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling and x86 udelay uses per-CPU data. So your concern should apply for x86 too. I had the same concern and was planning on bring it up in the cpufreq mailing list after I made sure I didn't misunderstand anything. Btw, your concern should apply for single core scaling too, right? Context switch can complete within max udelay (general - 5ms, ARM - 2ms) time and CPU could have jumped from lowest to highest speed in that time and mess up udelay. I didn't see any code in cpufreq that deferred scaling during udelay. So, that's something I plan to ask cpufreq folks too. Let me know what you think. Thanks, Saravana P.S: Sent from phone. Pls excuse formatting issues. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 9:39 ` skannan at codeaurora.org @ 2010-04-21 9:50 ` Russell King - ARM Linux 2010-04-21 9:58 ` Gilles Chanteperdrix ` (3 more replies) 0 siblings, 4 replies; 19+ messages in thread From: Russell King - ARM Linux @ 2010-04-21 9:50 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 21, 2010 at 02:39:39AM -0700, skannan at codeaurora.org wrote: > Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling > and x86 udelay uses per-CPU data. So your concern should apply for x86 > too. I had the same concern and was planning on bring it up in the cpufreq > mailing list after I made sure I didn't misunderstand anything. Well, x86 looks buggy in this regard as well - the loops_per_jiffy value used is for a CPU which may not run the delay loop. > Btw, your concern should apply for single core scaling too, right? Context > switch can complete within max udelay (general - 5ms, ARM - 2ms) time and > CPU could have jumped > from lowest to highest speed in that time and mess up udelay. I didn't see > any code in cpufreq that deferred scaling during udelay. So, that's something > I plan to ask cpufreq folks too. Well, the assumption is that the CPUs will be running at their fastest speed at boot time, and therefore loops_per_jiffy will be calibrated such that we guarantee _at least_ the asked-for delay - which is the only guarantee udelay has. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 9:50 ` Russell King - ARM Linux @ 2010-04-21 9:58 ` Gilles Chanteperdrix 2010-04-21 10:00 ` Jamie Lokier ` (2 subsequent siblings) 3 siblings, 0 replies; 19+ messages in thread From: Gilles Chanteperdrix @ 2010-04-21 9:58 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Wed, Apr 21, 2010 at 02:39:39AM -0700, skannan at codeaurora.org wrote: >> Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling >> and x86 udelay uses per-CPU data. So your concern should apply for x86 >> too. I had the same concern and was planning on bring it up in the cpufreq >> mailing list after I made sure I didn't misunderstand anything. > > Well, x86 looks buggy in this regard as well - the loops_per_jiffy > value used is for a CPU which may not run the delay loop. It looks to me like x86 with a tsc use the tsc for udelay, and handle cpu changes correctly. See the function delay_tsc in arch/x86/lib/delay.c. However, they do not seem to handle frequency changes that well. -- Gilles. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 9:50 ` Russell King - ARM Linux 2010-04-21 9:58 ` Gilles Chanteperdrix @ 2010-04-21 10:00 ` Jamie Lokier 2010-04-21 19:29 ` Russell King - ARM Linux 2010-04-21 10:31 ` skannan at codeaurora.org 2010-04-23 9:00 ` Pavel Machek 3 siblings, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2010-04-21 10:00 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > Well, the assumption is that the CPUs will be running at their fastest > speed at boot time, and therefore loops_per_jiffy will be calibrated > such that we guarantee _at least_ the asked-for delay - which is the > only guarantee udelay has. That's an interesting and not altogether reliable assumption. On a device I'm working with, we just read a fixed-speed clock register in a loop. It's slower than the CPU register loop, but given udelay counts in great big slow _microsecond_ delays (how quaint! ;-) that's fine. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 10:00 ` Jamie Lokier @ 2010-04-21 19:29 ` Russell King - ARM Linux 2010-04-21 19:52 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: Russell King - ARM Linux @ 2010-04-21 19:29 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 21, 2010 at 11:00:08AM +0100, Jamie Lokier wrote: > Russell King - ARM Linux wrote: > > Well, the assumption is that the CPUs will be running at their fastest > > speed at boot time, and therefore loops_per_jiffy will be calibrated > > such that we guarantee _at least_ the asked-for delay - which is the > > only guarantee udelay has. > > That's an interesting and not altogether reliable assumption. That depends which bit you're talking about. udelay() must give you the delay you asked for, or a longer delay. If it gives you a shorter delay, it's buggy plain and simple. > On a device I'm working with, we just read a fixed-speed clock > register in a loop. It's slower than the CPU register loop, but given > udelay counts in great big slow _microsecond_ delays (how quaint! ;-) > that's fine. We could go to ns delays, but then we have a big problem - the cost of calculating the number of loops starts to become significant compared to the delays - and that's a quality of implementation factor. In fact, the existing cost has always been significant for short delays for slower (sub-100MHz) ARMs. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 19:29 ` Russell King - ARM Linux @ 2010-04-21 19:52 ` Jamie Lokier 2010-04-21 20:21 ` Russell King - ARM Linux 0 siblings, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2010-04-21 19:52 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Wed, Apr 21, 2010 at 11:00:08AM +0100, Jamie Lokier wrote: > > Russell King - ARM Linux wrote: > > > Well, the assumption is that the CPUs will be running at their fastest > > > speed at boot time, and therefore loops_per_jiffy will be calibrated > > > such that we guarantee _at least_ the asked-for delay - which is the > > > only guarantee udelay has. > > > > That's an interesting and not altogether reliable assumption. > > That depends which bit you're talking about. udelay() must give you the > delay you asked for, or a longer delay. If it gives you a shorter delay, > it's buggy plain and simple. > > > On a device I'm working with, we just read a fixed-speed clock > > register in a loop. It's slower than the CPU register loop, but given > > udelay counts in great big slow _microsecond_ delays (how quaint! ;-) > > that's fine. > > We could go to ns delays, but then we have a big problem - the cost of > calculating the number of loops starts to become significant compared to > the delays - and that's a quality of implementation factor. In fact, > the existing cost has always been significant for short delays for > slower (sub-100MHz) ARMs. I'm surprised it makes much difference to, say, 20MHz ARMs because you could structure it as a nested loop, the inner one executed once per microsecond and calibrated to 1us. The delays don't have to be super accurate. With a fixed-speed clock register known at compile time, the calculation tends to constant-fold nicely, even for ns delays. Not suitable for multi-target kernels but good on single-target. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 19:52 ` Jamie Lokier @ 2010-04-21 20:21 ` Russell King - ARM Linux 2010-04-21 20:47 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: Russell King - ARM Linux @ 2010-04-21 20:21 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 21, 2010 at 08:52:25PM +0100, Jamie Lokier wrote: > Russell King - ARM Linux wrote: > > We could go to ns delays, but then we have a big problem - the cost of > > calculating the number of loops starts to become significant compared to > > the delays - and that's a quality of implementation factor. In fact, > > the existing cost has always been significant for short delays for > > slower (sub-100MHz) ARMs. > > I'm surprised it makes much difference to, say, 20MHz ARMs because you > could structure it as a nested loop, the inner one executed once per > microsecond and calibrated to 1us. The delays don't have to be super > accurate. You don't understand the issue. On older ARMs, the single 32-bit multiply is not cheap; it shows up as having a significant time expense for very short delays - and that _does_ matter. Consider system performance where you're driving a bus using udelay() to provide 1us timings, but udelay ends up taking 10us instead every time because of the calculation for number of loops for a 1us timing. > With a fixed-speed clock register known at compile time, the > calculation tends to constant-fold nicely, even for ns delays. Not > suitable for multi-target kernels but good on single-target. Here you're making a very big assumption - that there's some register you can read which is regularly clocked. That's not true on a lot of older ARMs, where we struggle to satisfy sched_clock() due to lack of such a register. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 20:21 ` Russell King - ARM Linux @ 2010-04-21 20:47 ` Jamie Lokier 2010-04-21 20:57 ` Russell King - ARM Linux 0 siblings, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2010-04-21 20:47 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Wed, Apr 21, 2010 at 08:52:25PM +0100, Jamie Lokier wrote: > > Russell King - ARM Linux wrote: > > > We could go to ns delays, but then we have a big problem - the cost of > > > calculating the number of loops starts to become significant compared to > > > the delays - and that's a quality of implementation factor. In fact, > > > the existing cost has always been significant for short delays for > > > slower (sub-100MHz) ARMs. > > > > I'm surprised it makes much difference to, say, 20MHz ARMs because you > > could structure it as a nested loop, the inner one executed once per > > microsecond and calibrated to 1us. The delays don't have to be super > > accurate. > > You don't understand the issue. On older ARMs, the single 32-bit > multiply is not cheap; it shows up as having a significant time > expense for very short delays - and that _does_ matter. > > Consider system performance where you're driving a bus using udelay() > to provide 1us timings, but udelay ends up taking 10us instead every > time because of the calculation for number of loops for a 1us timing. Hence nested loop. You don't multiply. No calculation. > > With a fixed-speed clock register known at compile time, the > > calculation tends to constant-fold nicely, even for ns delays. Not > > suitable for multi-target kernels but good on single-target. > > Here you're making a very big assumption - that there's some register > you can read which is regularly clocked. That's not true on a lot of > older ARMs, where we struggle to satisfy sched_clock() due to lack of > such a register. Yes, I know. I'm lucky to have one :-) Where there is one, it seems like a good idea to use it if it's fast to read. -- Jamie ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 20:47 ` Jamie Lokier @ 2010-04-21 20:57 ` Russell King - ARM Linux 2010-04-22 0:14 ` Jamie Lokier 0 siblings, 1 reply; 19+ messages in thread From: Russell King - ARM Linux @ 2010-04-21 20:57 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 21, 2010 at 09:47:18PM +0100, Jamie Lokier wrote: > Russell King - ARM Linux wrote: > > You don't understand the issue. On older ARMs, the single 32-bit > > multiply is not cheap; it shows up as having a significant time > > expense for very short delays - and that _does_ matter. > > > > Consider system performance where you're driving a bus using udelay() > > to provide 1us timings, but udelay ends up taking 10us instead every > > time because of the calculation for number of loops for a 1us timing. > > Hence nested loop. You don't multiply. No calculation. Ok, since you seem to have a clear idea how to convert this into a double nested loop, try converting it: @ 0 <= r0 <= 0x7fffff06 ldr r2, .LC0 (loops_per_jiffy) ldr r2, [r2] @ max = 0x01ffffff mov r0, r0, lsr #14 @ max = 0x0001ffff mov r2, r2, lsr #10 @ max = 0x00007fff mul r0, r2, r0 @ max = 2^32-1 movs r0, r0, lsr #6 moveq pc, lr 1: subs r0, r0, #1 bhi 1b mov pc, lr into two loops without losing the precision - note that the multiply is part of a 'dividing by multiply+shift' technique. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 20:57 ` Russell King - ARM Linux @ 2010-04-22 0:14 ` Jamie Lokier 2011-01-08 23:24 ` Russell King - ARM Linux 0 siblings, 1 reply; 19+ messages in thread From: Jamie Lokier @ 2010-04-22 0:14 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > > > Consider system performance where you're driving a bus using udelay() > > > to provide 1us timings, but udelay ends up taking 10us instead every > > > time because of the calculation for number of loops for a 1us timing. > > > > Hence nested loop. You don't multiply. No calculation. > > Ok, since you seem to have a clear idea how to convert this into a double > nested loop, try converting it: > > @ 0 <= r0 <= 0x7fffff06 > ldr r2, .LC0 (loops_per_jiffy) > ldr r2, [r2] @ max = 0x01ffffff > mov r0, r0, lsr #14 @ max = 0x0001ffff > mov r2, r2, lsr #10 @ max = 0x00007fff > mul r0, r2, r0 @ max = 2^32-1 > movs r0, r0, lsr #6 > moveq pc, lr > 1: subs r0, r0, #1 > bhi 1b > mov pc, lr > > into two loops without losing the precision - note that the multiply > is part of a 'dividing by multiply+shift' technique. ldr r2, loops_per_jiffy ldr r3, microseconds_per_jiffy mov r4, r2 1: subs r4, r4, r3 bhi 1b subs r0, r0, #1 add r4, r4, r2 bhi 1b mov pc, lr Goodnight :) - Jamie (Admission: I wasn't thinking of high precision when I glibly said two loops; your challenge prompted me to work it out, and I was pleasantly surprised to see it come out so neatly.) ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-22 0:14 ` Jamie Lokier @ 2011-01-08 23:24 ` Russell King - ARM Linux 0 siblings, 0 replies; 19+ messages in thread From: Russell King - ARM Linux @ 2011-01-08 23:24 UTC (permalink / raw) To: linux-arm-kernel On Thu, Apr 22, 2010 at 01:14:17AM +0100, Jamie Lokier wrote: > Russell King - ARM Linux wrote: > > Ok, since you seem to have a clear idea how to convert this into a double > > nested loop, try converting it: > > > > @ 0 <= r0 <= 0x7fffff06 > > ldr r2, .LC0 (loops_per_jiffy) > > ldr r2, [r2] @ max = 0x01ffffff > > mov r0, r0, lsr #14 @ max = 0x0001ffff > > mov r2, r2, lsr #10 @ max = 0x00007fff > > mul r0, r2, r0 @ max = 2^32-1 > > movs r0, r0, lsr #6 > > moveq pc, lr > > 1: subs r0, r0, #1 > > bhi 1b > > mov pc, lr > > > > into two loops without losing the precision - note that the multiply > > is part of a 'dividing by multiply+shift' technique. > > ldr r2, loops_per_jiffy > ldr r3, microseconds_per_jiffy > mov r4, r2 > 1: subs r4, r4, r3 > bhi 1b > subs r0, r0, #1 > add r4, r4, r2 > bhi 1b > mov pc, lr > > Goodnight :) I thought I'd dig this out and give it a go - but it has problems. Let's say usec_per_jiffy is 10000. Initially, loops_per_jiffy is 1<<12 or 4096 at boot. If udelay() is used prior to calibration (it is - see things like OMAP/8250 console drivers which use udelay(1)), the initial loops_per_jiffy value will be used. So, r0 = 10000. r3 = 10000. r2 = 4096. mov r4, r2 @ r4 := 4096 1: subs r4, r4, r3 @ r4 -= 10000 := -5904 bhi 1b @ not taken subs r0, r0, #1 @ r0 -= 1 := 9999 add r4, r4, r2 @ r4 += 4096 := -1808 (or 4294965488) bhi 1b @ taken That's the first iteration. The next iteration: 1: subs r4, r4, r3 @ r4 -= 10000 := 4294955488 bhi 1b @ taken 1: subs r4, r4, r3 @ r4 -= 10000 := 4294945488 bhi 1b @ taken ... which means we have about 429493 loops to go ... So this becomes an extremely slow loop - it works when loops_per_jiffy > usec_per_jiffy. Even with a value of 8192 (the first tried lpj in the calibration loop), things eventually go wrong - r4 on each iteration goes -1808, -3616, .. -9040 and then we're into the problem above - and this will be the case for anyone with HZ=100. So, this solution has undesirable behaviours... and this is what I've come up with - we manually increase the lpj and decrease the required delay by a power of two until we meet the necessary preconditions (lpj >= usec/jiffy). .LC0: .long loops_per_jiffy .long (1000000 + (HZ / 2))/HZ ENTRY(__delay) ldr r3, .LC0 + 4 @ usec/jiffy mov r2, r0 mov r0, r3 b 2f ENTRY(__udelay) ldr r2, .LC0 ldr r3, .LC0 + 4 @ usec/jiffy ldr r2, [r2] @ lpj 2: cmp r2, r3 movcc r2, r2, lsl #1 movcc r0, r0, lsr #1 bcc 2b mov ip, r2 1: subs ip, ip, r3 addls ip, ip, r2 sublss r0, r0, #2 bhi 1b mov pc, lr Note that I've also tweaked the loop a little to make the cycle count (in theory) around the loop the same no matter what it does. This way, I get the same lpj calibration value as the old way - which is good as with the old way, we were calibrating just this loop: 1: subs r0, r0, #1 bhi 1b mov pc, lr where r0 = lpj and the target delay time was 1 jiffy. Now, what sparked this off was: > > We could go to ns delays, but then we have a big problem - the cost of > > calculating the number of loops starts to become significant compared to > > the delays - and that's a quality of implementation factor. In fact, > > the existing cost has always been significant for short delays for > > slower (sub-100MHz) ARMs. > > I'm surprised it makes much difference to, say, 20MHz ARMs because you > could structure it as a nested loop, the inner one executed once per > microsecond and calibrated to 1us. The delays don't have to be super > accurate. With this nested loop approach we can't go to ns resolution. nsec_per_jiffy would be 10000000, and with an initial loops_per_jiffy of 4096 or 8192, this would be extremely bad. That said, I do think your approach has merit - especially as we're now seeing CPUs in the 2000 BogoMips range, and our existing solution goes bad at 3355 BogoMips. As the board I have is something like 8 months old we've probably got what, 10 months left according to Moore's law? ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 9:50 ` Russell King - ARM Linux 2010-04-21 9:58 ` Gilles Chanteperdrix 2010-04-21 10:00 ` Jamie Lokier @ 2010-04-21 10:31 ` skannan at codeaurora.org 2010-04-21 19:33 ` Russell King - ARM Linux 2010-04-23 9:00 ` Pavel Machek 3 siblings, 1 reply; 19+ messages in thread From: skannan at codeaurora.org @ 2010-04-21 10:31 UTC (permalink / raw) To: linux-arm-kernel > Well, the assumption is that the CPUs will be running at their fastest > speed at boot time, and therefore loops_per_jiffy will be calibrated > such that we guarantee _at least_ the asked-for delay - which is the > only guarantee udelay has. Even if the boot assumption is true, cpufreq actively changes the loops_per_jiffy value when it changes freq. So, this could still mess up the _at least_ guarantee. I think it's right for cpufreq to scale lpj with cpu speed to avoid udelay from taking longer by several magnitudes. But we also need to avoid breaking the guarantee. May be we can postpose the freq change if a udelay loop is active. Thanks, Saravana ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 10:31 ` skannan at codeaurora.org @ 2010-04-21 19:33 ` Russell King - ARM Linux 2010-04-21 23:47 ` Saravana Kannan 0 siblings, 1 reply; 19+ messages in thread From: Russell King - ARM Linux @ 2010-04-21 19:33 UTC (permalink / raw) To: linux-arm-kernel On Wed, Apr 21, 2010 at 03:31:03AM -0700, skannan at codeaurora.org wrote: > > Well, the assumption is that the CPUs will be running at their fastest > > speed at boot time, and therefore loops_per_jiffy will be calibrated > > such that we guarantee _at least_ the asked-for delay - which is the > > only guarantee udelay has. > > Even if the boot assumption is true, cpufreq actively changes the > loops_per_jiffy value when it changes freq. So, this could still mess up > the _at least_ guarantee. Actually, it doesn't on SMP - if you build the kernel with SMP enabled, the code which fiddles with loops_per_jiffy is disabled. See the #ifndef wrapping around adjust_jiffies() in drivers/cpufreq/cpufreq.c. So, on SMP with cpufreq, the global loops_per_jiffy is a fixed value. Provided it was calibrated with the CPU running at max clock rate, the guarantee is satisfied for all CPUs in the system. ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 19:33 ` Russell King - ARM Linux @ 2010-04-21 23:47 ` Saravana Kannan 0 siblings, 0 replies; 19+ messages in thread From: Saravana Kannan @ 2010-04-21 23:47 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Wed, Apr 21, 2010 at 03:31:03AM -0700, skannan at codeaurora.org wrote: >>> Well, the assumption is that the CPUs will be running at their fastest >>> speed at boot time, and therefore loops_per_jiffy will be calibrated >>> such that we guarantee _at least_ the asked-for delay - which is the >>> only guarantee udelay has. >> Even if the boot assumption is true, cpufreq actively changes the >> loops_per_jiffy value when it changes freq. So, this could still mess up >> the _at least_ guarantee. > > Actually, it doesn't on SMP - if you build the kernel with SMP enabled, > the code which fiddles with loops_per_jiffy is disabled. See the > #ifndef wrapping around adjust_jiffies() in drivers/cpufreq/cpufreq.c. My comment above was for the non-SMP case (it was a reply to your comment about non-SMP case). In non-SMP case, cpufreq changes LPJ and the freq switch can happen while udelay is looping. That would mess up the minimum delay guarantee of udelay. I was aware that cpufreq doesn't change LPJ for SMP. But I think they do that because they don't know where the arch specific per-CPU loops_per_jiffy is located. They expect the cpufreq driver to do the lpj scaling. So, per-CPU lpj is still going to change. At least, that's what I took out of the following comment: /* * This function alters the system "loops_per_jiffy" for the clock * speed change. Note that loops_per_jiffy cannot be updated on SMP * systems as each CPU might be scaled differently. So, use the arch * per-CPU loops_per_jiffy value wherever possible. */ > So, on SMP with cpufreq, the global loops_per_jiffy is a fixed value. > Provided it was calibrated with the CPU running at max clock rate, > the guarantee is satisfied for all CPUs in the system. As mentioned earlier, I think the cpufreq driver for that specific arch is supposed to handle the LPJ changes. But let's assume that's not true. So, wouldn't this still be a problem? You could be doing udelay as if you are running at 1 GHz but you are actually running at 100 MHz. I would think that would be bad for performance and power (wasting cycles without going into WFI, etc). Thanks, Saravana ^ permalink raw reply [flat|nested] 19+ messages in thread
* udelay() broken for SMP cores? 2010-04-21 9:50 ` Russell King - ARM Linux ` (2 preceding siblings ...) 2010-04-21 10:31 ` skannan at codeaurora.org @ 2010-04-23 9:00 ` Pavel Machek 3 siblings, 0 replies; 19+ messages in thread From: Pavel Machek @ 2010-04-23 9:00 UTC (permalink / raw) To: linux-arm-kernel Hi! > > Is this an ARM specific decision? Cpufreq certainly supports per cpu scaling > > and x86 udelay uses per-CPU data. So your concern should apply for x86 > > too. I had the same concern and was planning on bring it up in the cpufreq > > mailing list after I made sure I didn't misunderstand anything. > > Well, x86 looks buggy in this regard as well - the loops_per_jiffy > value used is for a CPU which may not run the delay loop. x86 assumes all cores to be essentially same. > > Btw, your concern should apply for single core scaling too, right? Context > > switch can complete within max udelay (general - 5ms, ARM - 2ms) time and > > CPU could have jumped > > from lowest to highest speed in that time and mess up udelay. I didn't see > > any code in cpufreq that deferred scaling during udelay. So, that's something > > I plan to ask cpufreq folks too. > > Well, the assumption is that the CPUs will be running at their fastest > speed at boot time, and therefore loops_per_jiffy will be calibrated > such that we guarantee _at least_ the asked-for delay - which is the > only guarantee udelay has. Well, some machines can't reach max cpu speed on battery power, so there may be a problem there. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2011-01-08 23:24 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-21 2:19 udelay() broken for SMP cores? Saravana Kannan 2010-04-21 4:56 ` Shilimkar, Santosh 2010-04-21 6:43 ` Saravana Kannan 2010-04-21 7:22 ` Russell King - ARM Linux 2010-04-21 9:39 ` skannan at codeaurora.org 2010-04-21 9:50 ` Russell King - ARM Linux 2010-04-21 9:58 ` Gilles Chanteperdrix 2010-04-21 10:00 ` Jamie Lokier 2010-04-21 19:29 ` Russell King - ARM Linux 2010-04-21 19:52 ` Jamie Lokier 2010-04-21 20:21 ` Russell King - ARM Linux 2010-04-21 20:47 ` Jamie Lokier 2010-04-21 20:57 ` Russell King - ARM Linux 2010-04-22 0:14 ` Jamie Lokier 2011-01-08 23:24 ` Russell King - ARM Linux 2010-04-21 10:31 ` skannan at codeaurora.org 2010-04-21 19:33 ` Russell King - ARM Linux 2010-04-21 23:47 ` Saravana Kannan 2010-04-23 9:00 ` Pavel Machek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).