From: linux@arm.linux.org.uk (Russell King - ARM Linux)
To: linux-arm-kernel@lists.infradead.org
Subject: udelay() broken for SMP cores?
Date: Sat, 8 Jan 2011 23:24:27 +0000 [thread overview]
Message-ID: <20110108232427.GA19891@n2100.arm.linux.org.uk> (raw)
In-Reply-To: <20100422001417.GF27575@shareable.org>
On Thu, Apr 22, 2010 at 01:14:17AM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > Ok, since you seem to have a clear idea how to convert this into a double
> > nested loop, try converting it:
> >
> > @ 0 <= r0 <= 0x7fffff06
> > ldr r2, .LC0 (loops_per_jiffy)
> > ldr r2, [r2] @ max = 0x01ffffff
> > mov r0, r0, lsr #14 @ max = 0x0001ffff
> > mov r2, r2, lsr #10 @ max = 0x00007fff
> > mul r0, r2, r0 @ max = 2^32-1
> > movs r0, r0, lsr #6
> > moveq pc, lr
> > 1: subs r0, r0, #1
> > bhi 1b
> > mov pc, lr
> >
> > into two loops without losing the precision - note that the multiply
> > is part of a 'dividing by multiply+shift' technique.
>
> ldr r2, loops_per_jiffy
> ldr r3, microseconds_per_jiffy
> mov r4, r2
> 1: subs r4, r4, r3
> bhi 1b
> subs r0, r0, #1
> add r4, r4, r2
> bhi 1b
> mov pc, lr
>
> Goodnight :)
I thought I'd dig this out and give it a go - but it has problems. Let's
say usec_per_jiffy is 10000. Initially, loops_per_jiffy is 1<<12 or 4096
at boot.
If udelay() is used prior to calibration (it is - see things like OMAP/8250
console drivers which use udelay(1)), the initial loops_per_jiffy value
will be used.
So, r0 = 10000. r3 = 10000. r2 = 4096.
mov r4, r2 @ r4 := 4096
1: subs r4, r4, r3 @ r4 -= 10000 := -5904
bhi 1b @ not taken
subs r0, r0, #1 @ r0 -= 1 := 9999
add r4, r4, r2 @ r4 += 4096 := -1808 (or 4294965488)
bhi 1b @ taken
That's the first iteration. The next iteration:
1: subs r4, r4, r3 @ r4 -= 10000 := 4294955488
bhi 1b @ taken
1: subs r4, r4, r3 @ r4 -= 10000 := 4294945488
bhi 1b @ taken
... which means we have about 429493 loops to go ...
So this becomes an extremely slow loop - it works when loops_per_jiffy >
usec_per_jiffy.
Even with a value of 8192 (the first tried lpj in the calibration loop),
things eventually go wrong - r4 on each iteration goes -1808, -3616, ..
-9040 and then we're into the problem above - and this will be the case
for anyone with HZ=100.
So, this solution has undesirable behaviours... and this is what I've
come up with - we manually increase the lpj and decrease the required
delay by a power of two until we meet the necessary preconditions
(lpj >= usec/jiffy).
.LC0: .long loops_per_jiffy
.long (1000000 + (HZ / 2))/HZ
ENTRY(__delay)
ldr r3, .LC0 + 4 @ usec/jiffy
mov r2, r0
mov r0, r3
b 2f
ENTRY(__udelay)
ldr r2, .LC0
ldr r3, .LC0 + 4 @ usec/jiffy
ldr r2, [r2] @ lpj
2: cmp r2, r3
movcc r2, r2, lsl #1
movcc r0, r0, lsr #1
bcc 2b
mov ip, r2
1: subs ip, ip, r3
addls ip, ip, r2
sublss r0, r0, #2
bhi 1b
mov pc, lr
Note that I've also tweaked the loop a little to make the cycle count
(in theory) around the loop the same no matter what it does.
This way, I get the same lpj calibration value as the old way - which is
good as with the old way, we were calibrating just this loop:
1: subs r0, r0, #1
bhi 1b
mov pc, lr
where r0 = lpj and the target delay time was 1 jiffy.
Now, what sparked this off was:
> > We could go to ns delays, but then we have a big problem - the cost of
> > calculating the number of loops starts to become significant compared to
> > the delays - and that's a quality of implementation factor. In fact,
> > the existing cost has always been significant for short delays for
> > slower (sub-100MHz) ARMs.
>
> I'm surprised it makes much difference to, say, 20MHz ARMs because you
> could structure it as a nested loop, the inner one executed once per
> microsecond and calibrated to 1us. The delays don't have to be super
> accurate.
With this nested loop approach we can't go to ns resolution.
nsec_per_jiffy would be 10000000, and with an initial loops_per_jiffy
of 4096 or 8192, this would be extremely bad.
That said, I do think your approach has merit - especially as we're
now seeing CPUs in the 2000 BogoMips range, and our existing solution
goes bad at 3355 BogoMips. As the board I have is something like 8
months old we've probably got what, 10 months left according to Moore's
law?
next prev parent reply other threads:[~2011-01-08 23:24 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-21 2:19 udelay() broken for SMP cores? Saravana Kannan
2010-04-21 4:56 ` Shilimkar, Santosh
2010-04-21 6:43 ` Saravana Kannan
2010-04-21 7:22 ` Russell King - ARM Linux
2010-04-21 9:39 ` skannan at codeaurora.org
2010-04-21 9:50 ` Russell King - ARM Linux
2010-04-21 9:58 ` Gilles Chanteperdrix
2010-04-21 10:00 ` Jamie Lokier
2010-04-21 19:29 ` Russell King - ARM Linux
2010-04-21 19:52 ` Jamie Lokier
2010-04-21 20:21 ` Russell King - ARM Linux
2010-04-21 20:47 ` Jamie Lokier
2010-04-21 20:57 ` Russell King - ARM Linux
2010-04-22 0:14 ` Jamie Lokier
2011-01-08 23:24 ` Russell King - ARM Linux [this message]
2010-04-21 10:31 ` skannan at codeaurora.org
2010-04-21 19:33 ` Russell King - ARM Linux
2010-04-21 23:47 ` Saravana Kannan
2010-04-23 9:00 ` Pavel Machek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110108232427.GA19891@n2100.arm.linux.org.uk \
--to=linux@arm.linux.org.uk \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).