* [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) @ 2010-09-28 3:33 Stephen Boyd 2010-09-28 3:33 ` [PATCH 1/3] [ARM] Translate delay.S into (mostly) C Stephen Boyd ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Stephen Boyd @ 2010-09-28 3:33 UTC (permalink / raw) To: linux-arm-kernel These patches are another attempt at fixing the udelay() issue pointed out on arm-lkml[1][2]. A quick recap: some SMP machines can scale their CPU frequencies independent of one another. loops_per_jiffy is calibrated globally and used in __const_udelay(). If one CPU is running faster than what the loops_per_jiffy is calculated (or scaled) for, udelay() will be incorrect and not wait long enough (or too long). A similar problem occurs if the cpu frequency is scaled during a udelay() call. We could fix this issue a couple ways, wholesale replacement of __udelay() and __const_udelay() (see [2] for that approach), or replacement of __delay() (this series). Option 1 can fail if anybody uses udelay() before memory is mapped and also duplicates most of the code in asm/delay.h. It also needs to hardcode the timer tick frequency, which can sometimes be inaccurate. The benefit is that loops_per_jiffy stays the same and thus BogoMIPS is unchanged. Option 2 can't fail since the __delay() loop is replaced after memory is mapped in, but it suffers from a low BogoMIPS when timers are clocked slowly. It also more accurately calculates the timer tick frequency through the use of calibrate_delay_direct(). -- Reference -- [1] http://article.gmane.org/gmane.linux.kernel/977567 [2] http://article.gmane.org/gmane.linux.ports.arm.kernel/78496 Stephen Boyd (3): [ARM] Translate delay.S into (mostly) C [ARM] Allow machines to override __delay() [ARM] Implement a timer based __delay() loop arch/arm/include/asm/delay.h | 5 ++- arch/arm/kernel/armksyms.c | 4 -- arch/arm/lib/delay.S | 65 ------------------------------ arch/arm/lib/delay.c | 90 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 94 insertions(+), 70 deletions(-) delete mode 100644 arch/arm/lib/delay.S create mode 100644 arch/arm/lib/delay.c -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-09-28 3:33 [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) Stephen Boyd @ 2010-09-28 3:33 ` Stephen Boyd 2010-10-05 17:22 ` Daniel Walker 2010-09-28 3:33 ` [PATCH 2/3] [ARM] Allow machines to override __delay() Stephen Boyd 2010-09-28 3:33 ` [PATCH 3/3] [ARM] Implement a timer based __delay() loop Stephen Boyd 2 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-09-28 3:33 UTC (permalink / raw) To: linux-arm-kernel In the next patch we're going to allow machines to override the __delay() implementation at runtime so they can implement a timer based __delay() routine. It's easier to do this using C, so lets write udelay and friends in C. We lose the #if 0 code, which according to Russell is used "to make the delay loop more stable and predictable on older CPUs" (see http://article.gmane.org/gmane.linux.kernel/888867 for more info). We shouldn't be too worried though, since the next patch adds functionality to allow a machine to set the __delay() loop themselves, therefore allowing machines to resurrect the commented out code if they need it. bloat-o-meter shows an increase of 12 bytes. Further inspection of the assembly shows GCC copying the loops_per_jiffy pointer and the magic HZ value to the ends of __const_udelay() and _delay() thus contributing an extra 4 and 8 bytes of data to each function. These two values weren't taken into account in the delay.S version since they weren't part of the function in nm's eyes. This means we only really gained an extra 4 bytes due to GCC's decision to duplicate the loops_per_jiffy pointer in __const_udelay. $ scripts/bloat-o-meter vmlinux.orig vmlinux.new add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12) function old new delta __udelay 48 56 +8 __const_udelay 40 44 +4 Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Saravana Kannan <skannan@codeaurora.org> --- So maybe we can make the magic HZ value into a #define? UDELAY_SCALAR? UDELAY_MAGIC? arch/arm/include/asm/delay.h | 2 +- arch/arm/kernel/armksyms.c | 4 -- arch/arm/lib/delay.S | 65 ------------------------------------------ arch/arm/lib/delay.c | 57 ++++++++++++++++++++++++++++++++++++ 4 files changed, 58 insertions(+), 70 deletions(-) delete mode 100644 arch/arm/lib/delay.S create mode 100644 arch/arm/lib/delay.c diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h index b2deda1..ccc5ed5 100644 --- a/arch/arm/include/asm/delay.h +++ b/arch/arm/include/asm/delay.h @@ -8,7 +8,7 @@ #include <asm/param.h> /* HZ */ -extern void __delay(int loops); +extern void __delay(unsigned long loops); /* * This function intentionally does not exist; if you see references to diff --git a/arch/arm/kernel/armksyms.c b/arch/arm/kernel/armksyms.c index 8214bfe..259e549 100644 --- a/arch/arm/kernel/armksyms.c +++ b/arch/arm/kernel/armksyms.c @@ -52,10 +52,6 @@ extern void fpundefinstr(void); EXPORT_SYMBOL(__backtrace); - /* platform dependent support */ -EXPORT_SYMBOL(__udelay); -EXPORT_SYMBOL(__const_udelay); - /* networking */ EXPORT_SYMBOL(csum_partial); EXPORT_SYMBOL(csum_partial_copy_from_user); diff --git a/arch/arm/lib/delay.S b/arch/arm/lib/delay.S deleted file mode 100644 index 8d6a876..0000000 --- a/arch/arm/lib/delay.S +++ /dev/null @@ -1,65 +0,0 @@ -/* - * linux/arch/arm/lib/delay.S - * - * Copyright (C) 1995, 1996 Russell King - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - */ -#include <linux/linkage.h> -#include <asm/assembler.h> -#include <asm/param.h> - .text - -.LC0: .word loops_per_jiffy -.LC1: .word (2199023*HZ)>>11 - -/* - * r0 <= 2000 - * lpj <= 0x01ffffff (max. 3355 bogomips) - * HZ <= 1000 - */ - -ENTRY(__udelay) - ldr r2, .LC1 - mul r0, r2, r0 -ENTRY(__const_udelay) @ 0 <= r0 <= 0x7fffff06 - ldr r2, .LC0 - ldr r2, [r2] @ max = 0x01ffffff - mov r0, r0, lsr #14 @ max = 0x0001ffff - mov r2, r2, lsr #10 @ max = 0x00007fff - mul r0, r2, r0 @ max = 2^32-1 - movs r0, r0, lsr #6 - moveq pc, lr - -/* - * loops = r0 * HZ * loops_per_jiffy / 1000000 - * - * Oh, if only we had a cycle counter... - */ - -@ Delay routine -ENTRY(__delay) - subs r0, r0, #1 -#if 0 - movls pc, lr - subs r0, r0, #1 - movls pc, lr - subs r0, r0, #1 - movls pc, lr - subs r0, r0, #1 - movls pc, lr - subs r0, r0, #1 - movls pc, lr - subs r0, r0, #1 - movls pc, lr - subs r0, r0, #1 - movls pc, lr - subs r0, r0, #1 -#endif - bhi __delay - mov pc, lr -ENDPROC(__udelay) -ENDPROC(__const_udelay) -ENDPROC(__delay) diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c new file mode 100644 index 0000000..5ee0adc --- /dev/null +++ b/arch/arm/lib/delay.c @@ -0,0 +1,57 @@ +/* + * Originally from linux/arch/arm/lib/delay.S + * + * Copyright (C) 1995, 1996 Russell King + * Copyright (c) 2010, Code Aurora Forum. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#include <linux/module.h> +#include <linux/delay.h> + +/* + * loops = usecs * HZ * loops_per_jiffy / 1000000 + * + * Oh, if only we had a cycle counter... + */ +void __delay(unsigned long loops) +{ + asm volatile( + "1: subs %0, %0, #1 \n" + " bhi 1b \n" + : /* No output */ + : "r" (loops) + ); +} +EXPORT_SYMBOL(__delay); + +/* + * 0 <= xloops <= 0x7fffff06 + * loops_per_jiffy <= 0x01ffffff (max. 3355 bogomips) + */ +void __const_udelay(unsigned long xloops) +{ + unsigned long lpj; + unsigned long loops; + + xloops >>= 14; /* max = 0x01ffffff */ + lpj = loops_per_jiffy >> 10; /* max = 0x0001ffff */ + loops = lpj * xloops; /* max = 0x00007fff */ + loops >>= 6; /* max = 2^32-1 */ + + if (loops) + __delay(loops); +} +EXPORT_SYMBOL(__const_udelay); + +/* + * usecs <= 2000 + * HZ <= 1000 + */ +void __udelay(unsigned long usecs) +{ + __const_udelay(usecs * ((2199023*HZ)>>11)); +} +EXPORT_SYMBOL(__udelay); -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-09-28 3:33 ` [PATCH 1/3] [ARM] Translate delay.S into (mostly) C Stephen Boyd @ 2010-10-05 17:22 ` Daniel Walker 2010-10-06 3:36 ` Stephen Boyd 0 siblings, 1 reply; 22+ messages in thread From: Daniel Walker @ 2010-10-05 17:22 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2010-09-27 at 20:33 -0700, Stephen Boyd wrote: > In the next patch we're going to allow machines to override the > __delay() implementation at runtime so they can implement a timer > based __delay() routine. It's easier to do this using C, so lets > write udelay and friends in C. You shouldn't really reference this series in this comment. You have to look at this as a changeset comment. So you really want to describe what this change is doing not what the over all series is doing. It would be better to say something like, "We're implementing this in C to give us further flexibility in allowing overrides." But don't references "next patch" or "this series" , but you can do that in the intro email. > We lose the #if 0 code, which according to Russell is used "to > make the delay loop more stable and predictable on older CPUs" > (see http://article.gmane.org/gmane.linux.kernel/888867 for more > info). We shouldn't be too worried though, since the next patch > adds functionality to allow a machine to set the __delay() loop > themselves, therefore allowing machines to resurrect the > commented out code if they need it. > > bloat-o-meter shows an increase of 12 bytes. Further inspection > of the assembly shows GCC copying the loops_per_jiffy pointer and > the magic HZ value to the ends of __const_udelay() and _delay() > thus contributing an extra 4 and 8 bytes of data to each > function. These two values weren't taken into account in the > delay.S version since they weren't part of the function in nm's > eyes. This means we only really gained an extra 4 bytes due to > GCC's decision to duplicate the loops_per_jiffy pointer in > __const_udelay. > > $ scripts/bloat-o-meter vmlinux.orig vmlinux.new > add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12) > function old new delta > __udelay 48 56 +8 > __const_udelay 40 44 +4 I think the "size" command might be better for this type of stuff. It should give you the same output (or similar).. It's just used more frequently. Is this an -Os kernel, or -O2 ? > Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> > Reviewed-by: Saravana Kannan <skannan@codeaurora.org> > --- > > So maybe we can make the magic HZ value into a #define? UDELAY_SCALAR? > UDELAY_MAGIC? > > arch/arm/include/asm/delay.h | 2 +- > arch/arm/kernel/armksyms.c | 4 -- > arch/arm/lib/delay.S | 65 ------------------------------------------ > arch/arm/lib/delay.c | 57 ++++++++++++++++++++++++++++++++++++ > 4 files changed, 58 insertions(+), 70 deletions(-) > delete mode 100644 arch/arm/lib/delay.S > create mode 100644 arch/arm/lib/delay.c > > diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h > index b2deda1..ccc5ed5 100644 > --- a/arch/arm/include/asm/delay.h > +++ b/arch/arm/include/asm/delay.h > @@ -8,7 +8,7 @@ > > #include <asm/param.h> /* HZ */ > > -extern void __delay(int loops); > +extern void __delay(unsigned long loops); > > /* > * This function intentionally does not exist; if you see references to > diff --git a/arch/arm/kernel/armksyms.c b/arch/arm/kernel/armksyms.c > index 8214bfe..259e549 100644 > --- a/arch/arm/kernel/armksyms.c > +++ b/arch/arm/kernel/armksyms.c > @@ -52,10 +52,6 @@ extern void fpundefinstr(void); > > EXPORT_SYMBOL(__backtrace); > > - /* platform dependent support */ > -EXPORT_SYMBOL(__udelay); > -EXPORT_SYMBOL(__const_udelay); > - > /* networking */ > EXPORT_SYMBOL(csum_partial); > EXPORT_SYMBOL(csum_partial_copy_from_user); > diff --git a/arch/arm/lib/delay.S b/arch/arm/lib/delay.S > deleted file mode 100644 > index 8d6a876..0000000 > --- a/arch/arm/lib/delay.S > +++ /dev/null > @@ -1,65 +0,0 @@ > -/* > - * linux/arch/arm/lib/delay.S > - * > - * Copyright (C) 1995, 1996 Russell King > - * > - * This program is free software; you can redistribute it and/or modify > - * it under the terms of the GNU General Public License version 2 as > - * published by the Free Software Foundation. > - */ > -#include <linux/linkage.h> > -#include <asm/assembler.h> > -#include <asm/param.h> > - .text > - > -.LC0: .word loops_per_jiffy > -.LC1: .word (2199023*HZ)>>11 > - > -/* > - * r0 <= 2000 > - * lpj <= 0x01ffffff (max. 3355 bogomips) > - * HZ <= 1000 > - */ > - > -ENTRY(__udelay) > - ldr r2, .LC1 > - mul r0, r2, r0 > -ENTRY(__const_udelay) @ 0 <= r0 <= 0x7fffff06 > - ldr r2, .LC0 > - ldr r2, [r2] @ max = 0x01ffffff > - mov r0, r0, lsr #14 @ max = 0x0001ffff > - mov r2, r2, lsr #10 @ max = 0x00007fff > - mul r0, r2, r0 @ max = 2^32-1 > - movs r0, r0, lsr #6 > - moveq pc, lr > - > -/* > - * loops = r0 * HZ * loops_per_jiffy / 1000000 > - * > - * Oh, if only we had a cycle counter... > - */ > - > -@ Delay routine > -ENTRY(__delay) > - subs r0, r0, #1 > -#if 0 > - movls pc, lr > - subs r0, r0, #1 > - movls pc, lr > - subs r0, r0, #1 > - movls pc, lr > - subs r0, r0, #1 > - movls pc, lr > - subs r0, r0, #1 > - movls pc, lr > - subs r0, r0, #1 > - movls pc, lr > - subs r0, r0, #1 > - movls pc, lr > - subs r0, r0, #1 > -#endif > - bhi __delay > - mov pc, lr > -ENDPROC(__udelay) > -ENDPROC(__const_udelay) > -ENDPROC(__delay) > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c > new file mode 100644 > index 0000000..5ee0adc > --- /dev/null > +++ b/arch/arm/lib/delay.c > @@ -0,0 +1,57 @@ > +/* > + * Originally from linux/arch/arm/lib/delay.S > + * > + * Copyright (C) 1995, 1996 Russell King > + * Copyright (c) 2010, Code Aurora Forum. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > +#include <linux/module.h> > +#include <linux/delay.h> > + > +/* > + * loops = usecs * HZ * loops_per_jiffy / 1000000 > + * > + * Oh, if only we had a cycle counter... > + */ > +void __delay(unsigned long loops) > +{ > + asm volatile( > + "1: subs %0, %0, #1 \n" > + " bhi 1b \n" > + : /* No output */ > + : "r" (loops) > + ); > +} > +EXPORT_SYMBOL(__delay); > + > +/* > + * 0 <= xloops <= 0x7fffff06 > + * loops_per_jiffy <= 0x01ffffff (max. 3355 bogomips) As long as your doing a re-write may as well add some real text here, and for the others. > + */ > +void __const_udelay(unsigned long xloops) > +{ > + unsigned long lpj; > + unsigned long loops; > + > + xloops >>= 14; /* max = 0x01ffffff */ > + lpj = loops_per_jiffy >> 10; /* max = 0x0001ffff */ > + loops = lpj * xloops; /* max = 0x00007fff */ > + loops >>= 6; /* max = 2^32-1 */ > + > + if (loops) > + __delay(loops); likely/unlikely ? > +} > +EXPORT_SYMBOL(__const_udelay); > + > +/* > + * usecs <= 2000 > + * HZ <= 1000 > + */ > +void __udelay(unsigned long usecs) > +{ > + __const_udelay(usecs * ((2199023*HZ)>>11)); > +} > +EXPORT_SYMBOL(__udelay); -- Sent by a consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-05 17:22 ` Daniel Walker @ 2010-10-06 3:36 ` Stephen Boyd 2010-10-06 13:38 ` Daniel Walker 2010-10-06 14:26 ` Nicolas Pitre 0 siblings, 2 replies; 22+ messages in thread From: Stephen Boyd @ 2010-10-06 3:36 UTC (permalink / raw) To: linux-arm-kernel On 10/05/2010 10:22 AM, Daniel Walker wrote: > You shouldn't really reference this series in this comment. You have to > look at this as a changeset comment. So you really want to describe what > this change is doing not what the over all series is doing. > > It would be better to say something like, > > "We're implementing this in C to give us further flexibility in allowing > overrides." > > But don't references "next patch" or "this series" , but you can do that > in the intro email. > Why not? This is a common thing to do when multiple patches are related to one topic. Doing a git log and searching for "next patch" shows others doing the same. >> >> $ scripts/bloat-o-meter vmlinux.orig vmlinux.new >> add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12) >> function old new delta >> __udelay 48 56 +8 >> __const_udelay 40 44 +4 > > I think the "size" command might be better for this type of stuff. It > should give you the same output (or similar).. It's just used more > frequently. Ok. $ size vmlinux.orig text data bss dec hex filename 6533503 530232 1228296 8292031 7e86bf vmlinux.orig $ size vmlinux.new text data bss dec hex filename 6533607 530232 1228296 8292135 7e8727 vmlinux.new I should mention GCC decided to inline __delay() into __udelay() and __const_udelay() and also decided to inline __const_udelay() into __delay() meaning we lost the function interleaving the assembly file had. I'll make a note of that and sorry for being misleading. > Is this an -Os kernel, or -O2 ? -Os. >> +/* >> + * 0 <= xloops <= 0x7fffff06 >> + * loops_per_jiffy <= 0x01ffffff (max. 3355 bogomips) > > As long as your doing a re-write may as well add some real text here, > and for the others. Any suggestions on what to add? I hoped the original comments would be good enough already considering the code is unchanged. >> + */ >> +void __const_udelay(unsigned long xloops) >> +{ >> + unsigned long lpj; >> + unsigned long loops; >> + >> + xloops >>= 14; /* max = 0x01ffffff */ >> + lpj = loops_per_jiffy >> 10; /* max = 0x0001ffff */ >> + loops = lpj * xloops; /* max = 0x00007fff */ >> + loops >>= 6; /* max = 2^32-1 */ >> + >> + if (loops) >> + __delay(loops); > > likely/unlikely ? likely. Although I'm doubtful on how much it will help considering the assembly and the code are equivalent already for this part of the code. -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 3:36 ` Stephen Boyd @ 2010-10-06 13:38 ` Daniel Walker 2010-10-06 14:26 ` Nicolas Pitre 1 sibling, 0 replies; 22+ messages in thread From: Daniel Walker @ 2010-10-06 13:38 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2010-10-05 at 20:36 -0700, Stephen Boyd wrote: > On 10/05/2010 10:22 AM, Daniel Walker wrote: > > You shouldn't really reference this series in this comment. You have to > > look at this as a changeset comment. So you really want to describe what > > this change is doing not what the over all series is doing. > > > > It would be better to say something like, > > > > "We're implementing this in C to give us further flexibility in allowing > > overrides." > > > > But don't references "next patch" or "this series" , but you can do that > > in the intro email. > > > > Why not? This is a common thing to do when multiple patches are related > to one topic. Doing a git log and searching for "next patch" shows > others doing the same. I would not say it's common .. I don't recall seeing many other series which do this. You shouldn't do it because the patch description doesn't stand on it own. What if this patch is accepted in isolation, and the others rejected would your description make sense? Also people don't always look at a "series" in git history, sometimes you only look at one commit so saying "next patch" or "this series" can be confusing. You can have an intro email where you can describe the whole series, including what your intending for each patch to do. > >> > >> $ scripts/bloat-o-meter vmlinux.orig vmlinux.new > >> add/remove: 0/0 grow/shrink: 2/0 up/down: 12/0 (12) > >> function old new delta > >> __udelay 48 56 +8 > >> __const_udelay 40 44 +4 > > > > I think the "size" command might be better for this type of stuff. It > > should give you the same output (or similar).. It's just used more > > frequently. > > Ok. > > $ size vmlinux.orig > text data bss dec hex filename > 6533503 530232 1228296 8292031 7e86bf vmlinux.orig > $ size vmlinux.new > text data bss dec hex filename > 6533607 530232 1228296 8292135 7e8727 vmlinux.new > > I should mention GCC decided to inline __delay() into __udelay() and > __const_udelay() and also decided to inline __const_udelay() into > __delay() meaning we lost the function interleaving the assembly file > had. I'll make a note of that and sorry for being misleading. > > > Is this an -Os kernel, or -O2 ? > > -Os. > > >> +/* > >> + * 0 <= xloops <= 0x7fffff06 > >> + * loops_per_jiffy <= 0x01ffffff (max. 3355 bogomips) > > > > As long as your doing a re-write may as well add some real text here, > > and for the others. > > Any suggestions on what to add? I hoped the original comments would be > good enough already considering the code is unchanged. you might want to say what the purpose of the function is .. > >> + */ > >> +void __const_udelay(unsigned long xloops) > >> +{ > >> + unsigned long lpj; > >> + unsigned long loops; > >> + > >> + xloops >>= 14; /* max = 0x01ffffff */ > >> + lpj = loops_per_jiffy >> 10; /* max = 0x0001ffff */ > >> + loops = lpj * xloops; /* max = 0x00007fff */ > >> + loops >>= 6; /* max = 2^32-1 */ > >> + > >> + if (loops) > >> + __delay(loops); > > > > likely/unlikely ? > > likely. Although I'm doubtful on how much it will help considering the > assembly and the code are equivalent already for this part of the code. I don't know, shouldn't hurt tho. Daniel -- Sent by an consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 3:36 ` Stephen Boyd 2010-10-06 13:38 ` Daniel Walker @ 2010-10-06 14:26 ` Nicolas Pitre 2010-10-06 18:30 ` Stephen Boyd 1 sibling, 1 reply; 22+ messages in thread From: Nicolas Pitre @ 2010-10-06 14:26 UTC (permalink / raw) To: linux-arm-kernel On Tue, 5 Oct 2010, Stephen Boyd wrote: > On 10/05/2010 10:22 AM, Daniel Walker wrote: > > I think the "size" command might be better for this type of stuff. It > > should give you the same output (or similar).. It's just used more > > frequently. > > Ok. > > $ size vmlinux.orig > text data bss dec hex filename > 6533503 530232 1228296 8292031 7e86bf vmlinux.orig > $ size vmlinux.new > text data bss dec hex filename > 6533607 530232 1228296 8292135 7e8727 vmlinux.new If you modified only one source file then I'd suggest you run 'size' only on the affected .o file instead. Nicolas ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 14:26 ` Nicolas Pitre @ 2010-10-06 18:30 ` Stephen Boyd 2010-10-06 19:35 ` Daniel Walker 0 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-10-06 18:30 UTC (permalink / raw) To: linux-arm-kernel On 10/06/2010 07:26 AM, Nicolas Pitre wrote: >> Ok. >> >> $ size vmlinux.orig >> text data bss dec hex filename >> 6533503 530232 1228296 8292031 7e86bf vmlinux.orig >> $ size vmlinux.new >> text data bss dec hex filename >> 6533607 530232 1228296 8292135 7e8727 vmlinux.new > > If you modified only one source file then I'd suggest you run 'size' > only on the affected .o file instead. Ok. $ size delay.o.orig text data bss dec hex filename 56 0 0 56 38 delay.o.orig $ size delay.o.new text data bss dec hex filename 192 0 0 192 c0 delay.o.new Perhaps I should mark __delay and __const_udelay as noinline? $ size delay.o.noinline text data bss dec hex filename 144 0 0 144 90 delay.o.noinline Now we get backwards branching. $ objdump -d delay.o.noinline Disassembly of section .text: 00000000 <__delay>: 0: e2500001 subs r0, r0, #1 ; 0x1 4: 8afffffd bhi 0 <__delay> 8: e12fff1e bx lr 0000000c <__const_udelay>: c: e59f3018 ldr r3, [pc, #24] ; 2c <__const_udelay+0x20> 10: e1a00720 lsr r0, r0, #14 14: e5933000 ldr r3, [r3] 18: e1a03523 lsr r3, r3, #10 1c: e0000093 mul r0, r3, r0 20: e1b00320 lsrs r0, r0, #6 24: 012fff1e bxeq lr 28: eafffffe b 0 <__delay> 2c: 00000000 .word 0x00000000 00000030 <__udelay>: 30: e59f3004 ldr r3, [pc, #4] ; 3c <__udelay+0xc> 34: e0000093 mul r0, r3, r0 38: eafffffe b c <__const_udelay> 3c: 0001a36e .word 0x0001a36e Is there some way to force GCC to do what I want (interleave the functions)? It seems happy to inline them and then optimize the register usage and instruction ordering. Perhaps that is OK though and we're wasting our time trying to be conservative in code size. $ objdump -d delay.o.new Disassembly of section .text: 00000000 <__delay>: 0: e2500001 subs r0, r0, #1 ; 0x1 4: 8afffffd bhi 0 <__delay> 8: e12fff1e bx lr 0000000c <__const_udelay>: c: e59f3020 ldr r3, [pc, #32] ; 34 <__const_udelay+0x28> 10: e1a00720 lsr r0, r0, #14 14: e5933000 ldr r3, [r3] 18: e1a03523 lsr r3, r3, #10 1c: e0000093 mul r0, r3, r0 20: e1b00320 lsrs r0, r0, #6 24: 012fff1e bxeq lr 28: e2500001 subs r0, r0, #1 ; 0x1 2c: 8afffffd bhi 28 <__const_udelay+0x1c> 30: e12fff1e bx lr 34: 00000000 .word 0x00000000 00000038 <__udelay>: 38: e59f3028 ldr r3, [pc, #40] ; 68 <__udelay+0x30> 3c: e59f2028 ldr r2, [pc, #40] ; 6c <__udelay+0x34> 40: e0030093 mul r3, r3, r0 44: e5922000 ldr r2, [r2] 48: e1a02522 lsr r2, r2, #10 4c: e1a03723 lsr r3, r3, #14 50: e0030392 mul r3, r2, r3 54: e1b03323 lsrs r3, r3, #6 58: 012fff1e bxeq lr 5c: e2533001 subs r3, r3, #1 ; 0x1 60: 8afffffd bhi 5c <__udelay+0x24> 64: e12fff1e bx lr 68: 0001a36e .word 0x0001a36e 6c: 00000000 .word 0x00000000 -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 18:30 ` Stephen Boyd @ 2010-10-06 19:35 ` Daniel Walker 2010-10-06 20:05 ` Nicolas Pitre 2010-10-06 20:24 ` Stephen Boyd 0 siblings, 2 replies; 22+ messages in thread From: Daniel Walker @ 2010-10-06 19:35 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2010-10-06 at 11:30 -0700, Stephen Boyd wrote: > Is there some way to force GCC to do what I want (interleave the > functions)? It seems happy to inline them and then optimize the > register > usage and instruction ordering. Perhaps that is OK though and we're > wasting our time trying to be conservative in code size. Is it possible to do all this in assembly ? Can't you have the default implementation using this assembly with different function names, then just set the assembly function names in C code someplace? Daniel -- Sent by an consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 19:35 ` Daniel Walker @ 2010-10-06 20:05 ` Nicolas Pitre 2010-10-08 0:11 ` Stephen Boyd 2010-10-06 20:24 ` Stephen Boyd 1 sibling, 1 reply; 22+ messages in thread From: Nicolas Pitre @ 2010-10-06 20:05 UTC (permalink / raw) To: linux-arm-kernel On Wed, 6 Oct 2010, Daniel Walker wrote: > On Wed, 2010-10-06 at 11:30 -0700, Stephen Boyd wrote: > > Is there some way to force GCC to do what I want (interleave the > > functions)? It seems happy to inline them and then optimize the > > register > > usage and instruction ordering. Perhaps that is OK though and we're > > wasting our time trying to be conservative in code size. You could use the noinline qualifier from <linux/compiler.h> with those functions you don't want inlined. > Is it possible to do all this in assembly ? Can't you have the default > implementation using this assembly with different function names, then > just set the assembly function names in C code someplace? That weould be my preference too. Being in assembly means that this code is unlikely to change with different optimization levels and/or gcc versions which would otherwise require different calibration values. Relying on stable calibration is necessary for the lpj kernel cmdline parameter to have some meaning. Nicolas ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 20:05 ` Nicolas Pitre @ 2010-10-08 0:11 ` Stephen Boyd 2010-10-08 1:12 ` Nicolas Pitre 0 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-10-08 0:11 UTC (permalink / raw) To: linux-arm-kernel On 10/06/2010 01:05 PM, Nicolas Pitre wrote: > You could use the noinline qualifier from <linux/compiler.h> with those > functions you don't want inlined. > That won't help me for the interleaving behavior though. >> Is it possible to do all this in assembly ? Can't you have the default >> implementation using this assembly with different function names, then >> just set the assembly function names in C code someplace? > That weould be my preference too. Being in assembly means that this > code is unlikely to change with different optimization levels and/or gcc > versions which would otherwise require different calibration values. > Relying on stable calibration is necessary for the lpj kernel cmdline > parameter to have some meaning. Why doesn't any other architecture use assembly for their lpj code? They may use headers with assembly in them or C code with assembly in them, but they don't write all of the delay code in assembly and rely on function interleaving. This leads me to believe other arches aren't concerned about compiler optimizations breaking lpj cmdline parameters, so why should ARM be concerned? I tested the theory out and scaled down the CPU frequency to 19.2 MHz and then called calibrate_delay(). Before and after applying this series I got the same results. Calibrating delay loop... 12.67 BogoMIPS (lpj=63360) Jumping up to 1.2 GHz and calling calibrate_delay() gives me the same before and after Calibrating delay loop... 792.98 BogoMIPS (lpj=3964928) I don't have access to a machine capable of running slower than 19.2 MHz. Maybe machines running in the KHz range would experience differences? -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-08 0:11 ` Stephen Boyd @ 2010-10-08 1:12 ` Nicolas Pitre 0 siblings, 0 replies; 22+ messages in thread From: Nicolas Pitre @ 2010-10-08 1:12 UTC (permalink / raw) To: linux-arm-kernel On Thu, 7 Oct 2010, Stephen Boyd wrote: > Why doesn't any other architecture use assembly for their lpj code? They > may use headers with assembly in them or C code with assembly in them, > but they don't write all of the delay code in assembly and rely on > function interleaving. This leads me to believe other arches aren't > concerned about compiler optimizations breaking lpj cmdline parameters, > so why should ARM be concerned? > > I tested the theory out and scaled down the CPU frequency to 19.2 MHz > and then called calibrate_delay(). Before and after applying this series > I got the same results. > > Calibrating delay loop... 12.67 BogoMIPS (lpj=63360) > > Jumping up to 1.2 GHz and calling calibrate_delay() gives me the same > before and after > > Calibrating delay loop... 792.98 BogoMIPS (lpj=3964928) OK, fair enough. > I don't have access to a machine capable of running slower than 19.2 > MHz. Maybe machines running in the KHz range would experience differences? Don't worry, I doubt any ARM processor capable of running Linux ever was that slow. Nicolas ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/3] [ARM] Translate delay.S into (mostly) C 2010-10-06 19:35 ` Daniel Walker 2010-10-06 20:05 ` Nicolas Pitre @ 2010-10-06 20:24 ` Stephen Boyd 1 sibling, 0 replies; 22+ messages in thread From: Stephen Boyd @ 2010-10-06 20:24 UTC (permalink / raw) To: linux-arm-kernel On 10/06/2010 12:35 PM, Daniel Walker wrote: > Is it possible to do all this in assembly ? Can't you have the default > implementation using this assembly with different function names, then > just set the assembly function names in C code someplace? Sure we could do that. I went this route because adding the timer based delay code was a copy paste instead of a copy translate. Actually, after adding the set_delay_fn code __const_udelay and __delay aren't inlined into __udelay anymore so we're back to the noinline behavior except we're missing interleaving. Finally, I thought it would be clearer what was going on if it was in C as opposed to assembly. How bad is a branch as opposed to fall through. And more importantly, how bad is a push/pop? 00000000 <delay_loop>: 0: e2500001 subs r0, r0, #1 ; 0x1 4: 8afffffd bhi 0 <delay_loop> 8: e12fff1e bx lr 0000000c <set_delay_fn>: c: e59f3004 ldr r3, [pc, #4] ; 18 <set_delay_fn+0xc> 10: e5830000 str r0, [r3] 14: e12fff1e bx lr 18: 00000000 .word 0x00000000 0000001c <__delay>: 1c: e92d4010 push {r4, lr} 20: e59f3008 ldr r3, [pc, #8] ; 30 <__delay+0x14> 24: e1a0e00f mov lr, pc 28: e593f000 ldr pc, [r3] 2c: e8bd8010 pop {r4, pc} 30: 00000000 .word 0x00000000 00000034 <__const_udelay>: 34: e59f3018 ldr r3, [pc, #24] ; 54 <__const_udelay+0x20> 38: e1a00720 lsr r0, r0, #14 3c: e5933000 ldr r3, [r3] 40: e1a03523 lsr r3, r3, #10 44: e0000093 mul r0, r3, r0 48: e1b00320 lsrs r0, r0, #6 4c: 012fff1e bxeq lr 50: eafffffe b 1c <__delay> 54: 00000000 .word 0x00000000 00000058 <__udelay>: 58: e59f3004 ldr r3, [pc, #4] ; 64 <__udelay+0xc> 5c: e0000093 mul r0, r3, r0 60: eafffffe b 34 <__const_udelay> 64: 0001a36e .word 0x0001a36e -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/3] [ARM] Allow machines to override __delay() 2010-09-28 3:33 [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) Stephen Boyd 2010-09-28 3:33 ` [PATCH 1/3] [ARM] Translate delay.S into (mostly) C Stephen Boyd @ 2010-09-28 3:33 ` Stephen Boyd 2010-10-05 17:29 ` Daniel Walker 2010-09-28 3:33 ` [PATCH 3/3] [ARM] Implement a timer based __delay() loop Stephen Boyd 2 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-09-28 3:33 UTC (permalink / raw) To: linux-arm-kernel Some machines want to implement their own __delay() routine based on fixed timers. Expose functionality to set the __delay() routine at runtime. This should allow two machines with different __delay() routines to happily co-exist within the same kernel with minimal overhead. Russell expressed concern that using a timer based __delay() would cause problems where an iomapped device isn't mapped in before a delay call was made (see http://article.gmane.org/gmane.linux.ports.arm.kernel/78543 for more info). We can sidestep that issue with this approach since the __delay() routine _should_ only be pointed to a timer based delay once the timer has been properly mapped. Up until that point __delay() and udelay() will use delay_loop() which is always safe to call. This patch is inspired by x86's delay.c Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Saravana Kannan <skannan@codeaurora.org> --- arch/arm/include/asm/delay.h | 2 ++ arch/arm/lib/delay.c | 21 ++++++++++++++++++--- 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h index ccc5ed5..7c732b5 100644 --- a/arch/arm/include/asm/delay.h +++ b/arch/arm/include/asm/delay.h @@ -40,5 +40,7 @@ extern void __const_udelay(unsigned long); __const_udelay((n) * ((2199023U*HZ)>>11))) : \ __udelay(n)) +extern void set_delay_fn(void (*fn)(unsigned long)); + #endif /* defined(_ARM_DELAY_H) */ diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c index 5ee0adc..b307fcc 100644 --- a/arch/arm/lib/delay.c +++ b/arch/arm/lib/delay.c @@ -3,6 +3,8 @@ * * Copyright (C) 1995, 1996 Russell King * Copyright (c) 2010, Code Aurora Forum. All rights reserved. + * Copyright (C) 1993 Linus Torvalds + * Copyright (C) 1997 Martin Mares <mj@atrey.karlin.mff.cuni.cz> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -12,11 +14,9 @@ #include <linux/delay.h> /* - * loops = usecs * HZ * loops_per_jiffy / 1000000 - * * Oh, if only we had a cycle counter... */ -void __delay(unsigned long loops) +void delay_loop(unsigned long loops) { asm volatile( "1: subs %0, %0, #1 \n" @@ -25,6 +25,21 @@ void __delay(unsigned long loops) : "r" (loops) ); } + +static void (*delay_fn)(unsigned long) = delay_loop; + +void set_delay_fn(void (*fn)(unsigned long)) +{ + delay_fn = fn; +} + +/* + * loops = usecs * HZ * loops_per_jiffy / 1000000 + */ +void __delay(unsigned long loops) +{ + delay_fn(loops); +} EXPORT_SYMBOL(__delay); /* -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/3] [ARM] Allow machines to override __delay() 2010-09-28 3:33 ` [PATCH 2/3] [ARM] Allow machines to override __delay() Stephen Boyd @ 2010-10-05 17:29 ` Daniel Walker 2010-10-06 3:36 ` Stephen Boyd 0 siblings, 1 reply; 22+ messages in thread From: Daniel Walker @ 2010-10-05 17:29 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2010-09-27 at 20:33 -0700, Stephen Boyd wrote: > Some machines want to implement their own __delay() routine based > on fixed timers. Expose functionality to set the __delay() > routine at runtime. This should allow two machines with different > __delay() routines to happily co-exist within the same kernel > with minimal overhead. Say "This allows developers to implement their own __delay() ..." > Russell expressed concern that using a timer based __delay() > would cause problems where an iomapped device isn't mapped in > before a delay call was made (see > http://article.gmane.org/gmane.linux.ports.arm.kernel/78543 for > more info). We can sidestep that issue with this approach since > the __delay() routine _should_ only be pointed to a timer based > delay once the timer has been properly mapped. Up until that > point __delay() and udelay() will use delay_loop() which is > always safe to call. > > This patch is inspired by x86's delay.c > > Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> > Reviewed-by: Saravana Kannan <skannan@codeaurora.org> > --- > arch/arm/include/asm/delay.h | 2 ++ > arch/arm/lib/delay.c | 21 ++++++++++++++++++--- > 2 files changed, 20 insertions(+), 3 deletions(-) > > diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h > index ccc5ed5..7c732b5 100644 > --- a/arch/arm/include/asm/delay.h > +++ b/arch/arm/include/asm/delay.h > @@ -40,5 +40,7 @@ extern void __const_udelay(unsigned long); > __const_udelay((n) * ((2199023U*HZ)>>11))) : \ > __udelay(n)) > > +extern void set_delay_fn(void (*fn)(unsigned long)); > + > #endif /* defined(_ARM_DELAY_H) */ > > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c > index 5ee0adc..b307fcc 100644 > --- a/arch/arm/lib/delay.c > +++ b/arch/arm/lib/delay.c > @@ -3,6 +3,8 @@ > * > * Copyright (C) 1995, 1996 Russell King > * Copyright (c) 2010, Code Aurora Forum. All rights reserved. > + * Copyright (C) 1993 Linus Torvalds > + * Copyright (C) 1997 Martin Mares <mj@atrey.karlin.mff.cuni.cz> Who is Martin Mares? Why are you adding these? Daniel -- Sent by a consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/3] [ARM] Allow machines to override __delay() 2010-10-05 17:29 ` Daniel Walker @ 2010-10-06 3:36 ` Stephen Boyd 0 siblings, 0 replies; 22+ messages in thread From: Stephen Boyd @ 2010-10-06 3:36 UTC (permalink / raw) To: linux-arm-kernel On 10/05/2010 10:29 AM, Daniel Walker wrote: >> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c >> index 5ee0adc..b307fcc 100644 >> --- a/arch/arm/lib/delay.c >> +++ b/arch/arm/lib/delay.c >> @@ -3,6 +3,8 @@ >> * >> * Copyright (C) 1995, 1996 Russell King >> * Copyright (c) 2010, Code Aurora Forum. All rights reserved. >> + * Copyright (C) 1993 Linus Torvalds >> + * Copyright (C) 1997 Martin Mares <mj@atrey.karlin.mff.cuni.cz> > Who is Martin Mares? Why are you adding these? These are the respective copyrights from x86's delay.c for the code I copied. I'll see if they can be dropped. -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/3] [ARM] Implement a timer based __delay() loop 2010-09-28 3:33 [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) Stephen Boyd 2010-09-28 3:33 ` [PATCH 1/3] [ARM] Translate delay.S into (mostly) C Stephen Boyd 2010-09-28 3:33 ` [PATCH 2/3] [ARM] Allow machines to override __delay() Stephen Boyd @ 2010-09-28 3:33 ` Stephen Boyd 2010-10-05 17:38 ` Daniel Walker 2 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-09-28 3:33 UTC (permalink / raw) To: linux-arm-kernel udelay() can be incorrect on SMP machines that scale their CPU frequencies independently of one another (as pointed out here http://article.gmane.org/gmane.linux.kernel/977567). The delay loop can either be too fast or too slow depending on which CPU the loops_per_jiffy counter is calibrated on and which CPU the delay loop is running on. udelay() can also be incorrect if the CPU frequency switches during the __delay() loop, causing the loop to either terminate too early, or too late. We'd rather not fix udelay() by forcing it to run on one CPU or take the penalty of a rather large loops_per_jiffy being used in udelay() when the CPU is actually running slower. Instead we solve the problem by making __delay() into a timer based loop which should be unaffected by CPU frequency scaling. Therefore, implement a timer based __delay() loop which can be used in place of the current register based __delay() if a machine so chooses. The kernel is already prepared for a timer based approach (evident by the read_current_timer() function). If an arch implements read_current_timer(), calibrate_delay() will use a timer based function, calibrate_delay_direct(), to calculate loops_per_jiffy (in which case loops_per_jiffy should really be renamed to timer_ticks_per_jiffy). Since the loops_per_jiffy will be based on timer ticks, __delay() should be implemented as a loop around read_current_timer(). Doing this makes the expensive loops_per_jiffy calculation go away (saving ~150ms on boot time on my machine) and fixes udelay() by making it safe in the face of independently scaling CPUs. The only prerequisite is that read_current_timer() is monotonically increasing across calls (and doesn't overflow within ~2000us). There is a downside to this approach though. BogoMIPS is no longer "accurate" in that it reflects the BogoMIPS of the timer and not the CPU. On most SoC's the timer isn't running anywhere near as fast as the CPU so BogoMIPS will be ridiculously low (my timer runs at 4.8 MHz and thus my BogoMIPS is 9.6 compared to my CPU's 800). This shouldn't be too much of a concern though since BogoMIPS are bogus anyway (hence the name). This loop is pretty much a copy of AVR's version made more generic. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Saravana Kannan <skannan@codeaurora.org> --- arch/arm/include/asm/delay.h | 1 + arch/arm/lib/delay.c | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h index 7c732b5..5c6b9a3 100644 --- a/arch/arm/include/asm/delay.h +++ b/arch/arm/include/asm/delay.h @@ -41,6 +41,7 @@ extern void __const_udelay(unsigned long); __udelay(n)) extern void set_delay_fn(void (*fn)(unsigned long)); +extern void read_current_timer_delay_loop(unsigned long loops); #endif /* defined(_ARM_DELAY_H) */ diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c index b307fcc..59fdf64 100644 --- a/arch/arm/lib/delay.c +++ b/arch/arm/lib/delay.c @@ -5,6 +5,7 @@ * Copyright (c) 2010, Code Aurora Forum. All rights reserved. * Copyright (C) 1993 Linus Torvalds * Copyright (C) 1997 Martin Mares <mj@atrey.karlin.mff.cuni.cz> + * Copyright (C) 2005-2006 Atmel Corporation * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -12,6 +13,7 @@ */ #include <linux/module.h> #include <linux/delay.h> +#include <linux/timex.h> /* * Oh, if only we had a cycle counter... @@ -26,6 +28,22 @@ void delay_loop(unsigned long loops) ); } +#ifdef ARCH_HAS_READ_CURRENT_TIMER +/* + * Assuming read_current_timer() is monotonically increasing + * across calls. + */ +void read_current_timer_delay_loop(unsigned long loops) +{ + unsigned long bclock, now; + + read_current_timer(&bclock); + do { + read_current_timer(&now); + } while ((now - bclock) < loops); +} +#endif + static void (*delay_fn)(unsigned long) = delay_loop; void set_delay_fn(void (*fn)(unsigned long)) -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/3] [ARM] Implement a timer based __delay() loop 2010-09-28 3:33 ` [PATCH 3/3] [ARM] Implement a timer based __delay() loop Stephen Boyd @ 2010-10-05 17:38 ` Daniel Walker 2010-10-06 3:36 ` Stephen Boyd 0 siblings, 1 reply; 22+ messages in thread From: Daniel Walker @ 2010-10-05 17:38 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2010-09-27 at 20:33 -0700, Stephen Boyd wrote: > udelay() can be incorrect on SMP machines that scale their CPU > frequencies independently of one another (as pointed out here > http://article.gmane.org/gmane.linux.kernel/977567). The delay > loop can either be too fast or too slow depending on which CPU the > loops_per_jiffy counter is calibrated on and which CPU the delay > loop is running on. udelay() can also be incorrect if the > CPU frequency switches during the __delay() loop, causing the loop > to either terminate too early, or too late. > > We'd rather not fix udelay() by forcing it to run on one CPU or > take the penalty of a rather large loops_per_jiffy being used in > udelay() when the CPU is actually running slower. Instead we > solve the problem by making __delay() into a timer based loop > which should be unaffected by CPU frequency scaling. Therefore, > implement a timer based __delay() loop which can be used in place > of the current register based __delay() if a machine so chooses. > > The kernel is already prepared for a timer based approach > (evident by the read_current_timer() function). If an arch > implements read_current_timer(), calibrate_delay() will use a > timer based function, calibrate_delay_direct(), to calculate > loops_per_jiffy (in which case loops_per_jiffy should really be > renamed to timer_ticks_per_jiffy). Since the loops_per_jiffy will > be based on timer ticks, __delay() should be implemented as a > loop around read_current_timer(). > > Doing this makes the expensive loops_per_jiffy calculation go > away (saving ~150ms on boot time on my machine) and fixes > udelay() by making it safe in the face of independently scaling > CPUs. The only prerequisite is that read_current_timer() is > monotonically increasing across calls (and doesn't overflow > within ~2000us). > > There is a downside to this approach though. BogoMIPS is no > longer "accurate" in that it reflects the BogoMIPS of the timer > and not the CPU. On most SoC's the timer isn't running anywhere > near as fast as the CPU so BogoMIPS will be ridiculously low (my > timer runs at 4.8 MHz and thus my BogoMIPS is 9.6 compared to my > CPU's 800). This shouldn't be too much of a concern though since > BogoMIPS are bogus anyway (hence the name). > > This loop is pretty much a copy of AVR's version made more generic. > > Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> > Reviewed-by: Saravana Kannan <skannan@codeaurora.org> > --- > arch/arm/include/asm/delay.h | 1 + > arch/arm/lib/delay.c | 18 ++++++++++++++++++ > 2 files changed, 19 insertions(+), 0 deletions(-) > > diff --git a/arch/arm/include/asm/delay.h b/arch/arm/include/asm/delay.h > index 7c732b5..5c6b9a3 100644 > --- a/arch/arm/include/asm/delay.h > +++ b/arch/arm/include/asm/delay.h > @@ -41,6 +41,7 @@ extern void __const_udelay(unsigned long); > __udelay(n)) > > extern void set_delay_fn(void (*fn)(unsigned long)); > +extern void read_current_timer_delay_loop(unsigned long loops); > > #endif /* defined(_ARM_DELAY_H) */ > > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c > index b307fcc..59fdf64 100644 > --- a/arch/arm/lib/delay.c > +++ b/arch/arm/lib/delay.c > @@ -5,6 +5,7 @@ > * Copyright (c) 2010, Code Aurora Forum. All rights reserved. > * Copyright (C) 1993 Linus Torvalds > * Copyright (C) 1997 Martin Mares <mj@atrey.karlin.mff.cuni.cz> > + * Copyright (C) 2005-2006 Atmel Corporation > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -12,6 +13,7 @@ > */ > #include <linux/module.h> > #include <linux/delay.h> > +#include <linux/timex.h> > > /* > * Oh, if only we had a cycle counter... > @@ -26,6 +28,22 @@ void delay_loop(unsigned long loops) > ); > } > > +#ifdef ARCH_HAS_READ_CURRENT_TIMER > +/* > + * Assuming read_current_timer() is monotonically increasing > + * across calls. You should add more comments here. You assuming that it's monotonic over a 2000us (2ms) period .. I'm not sure this is a good assumption this timer may not be monotonically increasing all the time, what happens then? > +void read_current_timer_delay_loop(unsigned long loops) > +{ > + unsigned long bclock, now; > + > + read_current_timer(&bclock); > + do { > + read_current_timer(&now); > + } while ((now - bclock) < loops); Have you looked@time_before()/time_after() ? Daniel -- Sent by a consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/3] [ARM] Implement a timer based __delay() loop 2010-10-05 17:38 ` Daniel Walker @ 2010-10-06 3:36 ` Stephen Boyd 2010-10-06 13:44 ` Daniel Walker 0 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-10-06 3:36 UTC (permalink / raw) To: linux-arm-kernel On 10/05/2010 10:38 AM, Daniel Walker wrote: >> >> +#ifdef ARCH_HAS_READ_CURRENT_TIMER >> +/* >> + * Assuming read_current_timer() is monotonically increasing >> + * across calls. > You should add more comments here. You assuming that it's monotonic over > a 2000us (2ms) period .. I'm not sure this is a good assumption this > timer may not be monotonically increasing all the time, what happens > then? Ok I'll add that it shouldn't wrap more than once within 2000us (or should I say 5ms since mdelay uses udelay?). Is that what you're saying by it not being monotonically increasing? If a timer isn't increasing the tick count it's broken and this call will loop forever. If the timer wraps, we'll be safe due to unsigned maths as long as it wraps only once. >> +void read_current_timer_delay_loop(unsigned long loops) >> +{ >> + unsigned long bclock, now; >> + >> + read_current_timer(&bclock); >> + do { >> + read_current_timer(&now); >> + } while ((now - bclock) < loops); > Have you looked at time_before()/time_after() ? Nope. Wouldn't that require an addition though to make it work? I'd rather just leave it like it is. -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/3] [ARM] Implement a timer based __delay() loop 2010-10-06 3:36 ` Stephen Boyd @ 2010-10-06 13:44 ` Daniel Walker 0 siblings, 0 replies; 22+ messages in thread From: Daniel Walker @ 2010-10-06 13:44 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2010-10-05 at 20:36 -0700, Stephen Boyd wrote: > On 10/05/2010 10:38 AM, Daniel Walker wrote: > >> > >> +#ifdef ARCH_HAS_READ_CURRENT_TIMER > >> +/* > >> + * Assuming read_current_timer() is monotonically increasing > >> + * across calls. > > You should add more comments here. You assuming that it's monotonic over > > a 2000us (2ms) period .. I'm not sure this is a good assumption this > > timer may not be monotonically increasing all the time, what happens > > then? > > Ok I'll add that it shouldn't wrap more than once within 2000us (or > should I say 5ms since mdelay uses udelay?). Is that what you're saying > by it not being monotonically increasing? If a timer isn't increasing > the tick count it's broken and this call will loop forever. If the timer > wraps, we'll be safe due to unsigned maths as long as it wraps only once. I was saying it could be near wrapping when this process starts, then wrap in the middle. The unsigned math thing should work I think, but add a comment .. Your basically saying it's OK if it wraps, but the comment says it needs to be "monotonically increasing" which actually doesn't appear to be true. It needs to not wrap twice during this process. > >> +void read_current_timer_delay_loop(unsigned long loops) > >> +{ > >> + unsigned long bclock, now; > >> + > >> + read_current_timer(&bclock); > >> + do { > >> + read_current_timer(&now); > >> + } while ((now - bclock) < loops); > > Have you looked at time_before()/time_after() ? > > Nope. Wouldn't that require an addition though to make it work? I'd > rather just leave it like it is. I'm not saying you should change it, I'm saying look at those macro's to make sure your doing thinks correctly. Daniel -- Sent by an consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) @ 2010-09-07 18:23 Stephen Boyd 2010-09-22 21:54 ` Daniel Walker 0 siblings, 1 reply; 22+ messages in thread From: Stephen Boyd @ 2010-09-07 18:23 UTC (permalink / raw) To: linux-arm-kernel (Sorry, resending due to stray comma...) These patches are another attempt at fixing the udelay() issue pointed out on arm-lkml[1][2]. A quick recap: some SMP machines can scale their CPU frequencies independent of one another. loops_per_jiffy is calibrated globally and used in __const_udelay(). If one CPU is running faster than what the loops_per_jiffy is calculated (or scaled) for, udelay() will be incorrect and not wait long enough (or too long). A similar problem occurs if the cpu frequency is scaled during a udelay() call. We could fix this issue a couple ways, wholesale replacement of __udelay() and __const_udelay() (see [2] for that approach), or replacement of __delay() (this series). Option 1 can fail if anybody uses udelay() before memory is mapped and also duplicates most of the code in asm/delay.h. It also needs to hardcode the timer tick frequency, which can sometimes be inaccurate. The benefit is that loops_per_jiffy stays the same and thus BogoMIPS is unchanged. Option 2 can't fail since the __delay() loop is replaced after memory is mapped in, but it suffers from a low BogoMIPS when timers are clocked slowly. It also more accurately calculates the timer tick frequency through the use of calibrate_delay_direct(). -- Reference -- [1] http://article.gmane.org/gmane.linux.kernel/977567 [2] http://article.gmane.org/gmane.linux.ports.arm.kernel/78496 Stephen Boyd (3): [ARM] Translate delay.S into (mostly) C [ARM] Allow machines to override __delay() [ARM] Implement a timer based __delay() loop arch/arm/include/asm/delay.h | 5 ++- arch/arm/kernel/armksyms.c | 4 -- arch/arm/lib/delay.S | 65 ------------------------------ arch/arm/lib/delay.c | 90 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 94 insertions(+), 70 deletions(-) delete mode 100644 arch/arm/lib/delay.S create mode 100644 arch/arm/lib/delay.c -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) 2010-09-07 18:23 [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) Stephen Boyd @ 2010-09-22 21:54 ` Daniel Walker 0 siblings, 0 replies; 22+ messages in thread From: Daniel Walker @ 2010-09-22 21:54 UTC (permalink / raw) To: linux-arm-kernel Russell Is this series something you would be willing to pull into your tree? Daniel On Tue, 2010-09-07 at 11:23 -0700, Stephen Boyd wrote: > (Sorry, resending due to stray comma...) > > These patches are another attempt at fixing the udelay() > issue pointed out on arm-lkml[1][2]. A quick recap: some SMP > machines can scale their CPU frequencies independent of one > another. loops_per_jiffy is calibrated globally and used in > __const_udelay(). If one CPU is running faster than what the > loops_per_jiffy is calculated (or scaled) for, udelay() will > be incorrect and not wait long enough (or too long). A similar > problem occurs if the cpu frequency is scaled during a udelay() > call. > > We could fix this issue a couple ways, wholesale replacement > of __udelay() and __const_udelay() (see [2] for that approach), > or replacement of __delay() (this series). Option 1 can fail if > anybody uses udelay() before memory is mapped and also duplicates > most of the code in asm/delay.h. It also needs to hardcode the > timer tick frequency, which can sometimes be inaccurate. The > benefit is that loops_per_jiffy stays the same and thus BogoMIPS > is unchanged. Option 2 can't fail since the __delay() loop is > replaced after memory is mapped in, but it suffers from a low > BogoMIPS when timers are clocked slowly. It also more accurately > calculates the timer tick frequency through the use of > calibrate_delay_direct(). > > -- Reference -- > [1] http://article.gmane.org/gmane.linux.kernel/977567 > [2] http://article.gmane.org/gmane.linux.ports.arm.kernel/78496 > > Stephen Boyd (3): > [ARM] Translate delay.S into (mostly) C > [ARM] Allow machines to override __delay() > [ARM] Implement a timer based __delay() loop > > arch/arm/include/asm/delay.h | 5 ++- > arch/arm/kernel/armksyms.c | 4 -- > arch/arm/lib/delay.S | 65 ------------------------------ > arch/arm/lib/delay.c | 90 ++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 94 insertions(+), 70 deletions(-) > delete mode 100644 arch/arm/lib/delay.S > create mode 100644 arch/arm/lib/delay.c > -- Sent by a consultant of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) @ 2010-08-19 2:24 Stephen Boyd 0 siblings, 0 replies; 22+ messages in thread From: Stephen Boyd @ 2010-08-19 2:24 UTC (permalink / raw) To: linux-arm-kernel These patches are another attempt at fixing the udelay() issue pointed out on arm-lkml[1][2]. A quick recap: some SMP machines can scale their CPU frequencies independent of one another. loops_per_jiffy is calibrated globally and used in __const_udelay(). If one CPU is running faster than what the loops_per_jiffy is calculated (or scaled) for, udelay() will be incorrect and not wait long enough (or too long). A similar problem occurs if the cpu frequency is scaled during a udelay() call. We could fix this issue a couple ways, wholesale replacement of __udelay() and __const_udelay() (see [2] for that approach), or replacement of __delay() (this series). Option 1 can fail if anybody uses udelay() before memory is mapped and also duplicates most of the code in asm/delay.h. It also needs to hardcode the timer tick frequency, which can sometimes be inaccurate. The benefit is that loops_per_jiffy stays the same and thus BogoMIPS is unchanged. Option 2 can't fail since the __delay() loop is replaced after memory is mapped in, but it suffers from a low BogoMIPS when timers are clocked slowly. It also more accurately calculates the timer tick frequency through the use of calibrate_delay_direct(). -- Reference -- [1] http://article.gmane.org/gmane.linux.kernel/977567 [2] http://article.gmane.org/gmane.linux.ports.arm.kernel/78496 Stephen Boyd (3): [ARM] Translate delay.S into (mostly) C [ARM] Allow machines to override __delay() [ARM] Implement a timer based __delay() loop arch/arm/include/asm/delay.h | 5 ++- arch/arm/kernel/armksyms.c | 4 -- arch/arm/lib/delay.S | 65 ------------------------------ arch/arm/lib/delay.c | 90 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 94 insertions(+), 70 deletions(-) delete mode 100644 arch/arm/lib/delay.S create mode 100644 arch/arm/lib/delay.c -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2010-10-08 1:12 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-28 3:33 [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) Stephen Boyd 2010-09-28 3:33 ` [PATCH 1/3] [ARM] Translate delay.S into (mostly) C Stephen Boyd 2010-10-05 17:22 ` Daniel Walker 2010-10-06 3:36 ` Stephen Boyd 2010-10-06 13:38 ` Daniel Walker 2010-10-06 14:26 ` Nicolas Pitre 2010-10-06 18:30 ` Stephen Boyd 2010-10-06 19:35 ` Daniel Walker 2010-10-06 20:05 ` Nicolas Pitre 2010-10-08 0:11 ` Stephen Boyd 2010-10-08 1:12 ` Nicolas Pitre 2010-10-06 20:24 ` Stephen Boyd 2010-09-28 3:33 ` [PATCH 2/3] [ARM] Allow machines to override __delay() Stephen Boyd 2010-10-05 17:29 ` Daniel Walker 2010-10-06 3:36 ` Stephen Boyd 2010-09-28 3:33 ` [PATCH 3/3] [ARM] Implement a timer based __delay() loop Stephen Boyd 2010-10-05 17:38 ` Daniel Walker 2010-10-06 3:36 ` Stephen Boyd 2010-10-06 13:44 ` Daniel Walker -- strict thread matches above, loose matches on Subject: below -- 2010-09-07 18:23 [PATCH 0/3] Fixing udelay() on SMP (and non-SMP too) Stephen Boyd 2010-09-22 21:54 ` Daniel Walker 2010-08-19 2:24 Stephen Boyd
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).