* [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable @ 2005-07-27 14:13 Steven Rostedt 2005-07-27 14:17 ` Ingo Molnar 2005-07-28 1:00 ` Daniel Walker 0 siblings, 2 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 14:13 UTC (permalink / raw) To: LKML; +Cc: Linus Torvalds, Andrew Morton, Ingo Molnar The following patch makes the MAX_RT_PRIO and MAX_USER_RT_PRIO configurable from the make *config. This is more of a proposal since I'm not really sure where in Kconfig this would best fit. I don't see why these options shouldn't be user configurable without going into the kernel headers to change them. Also, is there a way in the Kconfig to force the checking of MAX_USER_RT_PRIO <= MAX_RT_PRIO? -- Steve (Patched against 2.6.12.2) Index: vanilla_kernel/include/linux/sched.h =================================================================== --- vanilla_kernel/include/linux/sched.h (revision 263) +++ vanilla_kernel/include/linux/sched.h (working copy) @@ -389,9 +389,13 @@ * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO. */ -#define MAX_USER_RT_PRIO 100 -#define MAX_RT_PRIO MAX_USER_RT_PRIO +#define MAX_USER_RT_PRIO CONFIG_MAX_USER_RT_PRIO +#define MAX_RT_PRIO CONFIG_MAX_RT_PRIO +#if MAX_USER_RT_PRIO > MAX_RT_PRIO +#error MAX_USER_RT_PRIO must not be greater than MAX_RT_PRIO +#endif + #define MAX_PRIO (MAX_RT_PRIO + 40) #define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO)) Index: vanilla_kernel/init/Kconfig =================================================================== --- vanilla_kernel/init/Kconfig (revision 263) +++ vanilla_kernel/init/Kconfig (working copy) @@ -162,6 +162,32 @@ building a kernel for install/rescue disks or your system is very limited in memory. +config MAX_RT_PRIO + int "Maximum RT priority" + default 100 + help + The real-time priority of threads that have the policy of SCHED_FIFO + or SCHED_RR have a priority higher than normal threads. This range + can be set here, where the range starts from 0 to MAX_RT_PRIO-1. + If this range is higher than MAX_USER_RT_PRIO than kernel threads + may have a higher priority than any user thread. + + This may be the same as MAX_USER_RT_PRIO, but do not set this + to be less than MAX_USER_RT_PRIO. + +config MAX_USER_RT_PRIO + int "Maximum User RT priority" + default 100 + help + The real-time priority of threads that have the policy of SCHED_FIFO + or SCHED_RR have a priority higher than normal threads. This range + can be set here, where the range starts from 0 to MAX_USER_RT_PRIO-1. + If this range is lower than MAX_RT_PRIO, than kernel threads may have + a higher priority than any user thread. + + This may be the same as MAX_RT_PRIO, but do not set this to be + greater than MAX_RT_PRIO. + config AUDIT bool "Auditing support" default y if SECURITY_SELINUX ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:13 [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Steven Rostedt @ 2005-07-27 14:17 ` Ingo Molnar 2005-07-27 14:26 ` Steven Rostedt ` (2 more replies) 2005-07-28 1:00 ` Daniel Walker 1 sibling, 3 replies; 50+ messages in thread From: Ingo Molnar @ 2005-07-27 14:17 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Linus Torvalds, Andrew Morton * Steven Rostedt <rostedt@goodmis.org> wrote: > The following patch makes the MAX_RT_PRIO and MAX_USER_RT_PRIO > configurable from the make *config. This is more of a proposal since > I'm not really sure where in Kconfig this would best fit. I don't see > why these options shouldn't be user configurable without going into > the kernel headers to change them. i'd not do this patch, mainly because the '100 priority levels' thing is pretty much an assumption in lots of userspace code. The patch to make it easier to redefine it is of course fine and was accepted, but i dont think we want to make it explicit via .config. It's a bit like with the 3:1 split: you can redefine it easily via include files, but it's not configurable via .config, because many people would just play with it and would see things break. so unless there's really a desire from distributions to actually change the 100 RT-prio levels (and i dont sense such a desire), we shouldnt do this. Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:17 ` Ingo Molnar @ 2005-07-27 14:26 ` Steven Rostedt 2005-07-27 14:33 ` Ingo Molnar 2005-07-27 14:28 ` Steven Rostedt 2005-07-28 1:42 ` Matt Mackall 2 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 14:26 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 16:17 +0200, Ingo Molnar wrote: > i'd not do this patch, mainly because the '100 priority levels' thing is > pretty much an assumption in lots of userspace code. The patch to make > it easier to redefine it is of course fine and was accepted, but i dont > think we want to make it explicit via .config. > > It's a bit like with the 3:1 split: you can redefine it easily via > include files, but it's not configurable via .config, because many > people would just play with it and would see things break. > > so unless there's really a desire from distributions to actually change > the 100 RT-prio levels (and i dont sense such a desire), we shouldnt do > this. Perfectly understood. I've had two customers ask me to increase the priorities for them, but those where custom kernels, and a config option wasn't necessary. But since I've had customers asking, I thought that this might be something that others want. But I deal with a niche market, and what my customers want might not be what everyone wants. (hence the RFC in the subject). So if there are others out there that would prefer to change their priority ranges, speak now otherwise this patch will go by the waste side. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:26 ` Steven Rostedt @ 2005-07-27 14:33 ` Ingo Molnar 2005-07-27 14:47 ` [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO Steven Rostedt 2005-07-27 14:53 ` [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Esben Nielsen 0 siblings, 2 replies; 50+ messages in thread From: Ingo Molnar @ 2005-07-27 14:33 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Linus Torvalds, Andrew Morton * Steven Rostedt <rostedt@goodmis.org> wrote: > Perfectly understood. I've had two customers ask me to increase the > priorities for them, but those where custom kernels, and a config > option wasn't necessary. But since I've had customers asking, I > thought that this might be something that others want. But I deal > with a niche market, and what my customers want might not be what > everyone wants. (hence the RFC in the subject). > > So if there are others out there that would prefer to change their > priority ranges, speak now otherwise this patch will go by the waste > side. i'm not excluding that this will become necessary in the future. We should also add the safety check to sched.h - all i'm suggesting is to not make it a .config option just now, because that tends to be fiddled with. Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO 2005-07-27 14:33 ` Ingo Molnar @ 2005-07-27 14:47 ` Steven Rostedt 2005-07-27 15:05 ` Steven Rostedt 2005-07-27 18:52 ` Ingo Molnar 2005-07-27 14:53 ` [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Esben Nielsen 1 sibling, 2 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 14:47 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 16:33 +0200, Ingo Molnar wrote: > i'm not excluding that this will become necessary in the future. We > should also add the safety check to sched.h - all i'm suggesting is to > not make it a .config option just now, because that tends to be fiddled > with. So Ingo, would you agree that at least this patch should go in? -- Steve (Patched against 2.6.12.2) Index: vanilla_kernel/include/linux/sched.h =================================================================== --- vanilla_kernel/include/linux/sched.h (revision 263) +++ vanilla_kernel/include/linux/sched.h (working copy) @@ -392,6 +392,10 @@ #define MAX_USER_RT_PRIO 100 #define MAX_RT_PRIO MAX_USER_RT_PRIO +#if MAX_USER_RT_PRIO > MAX_RT_PRIO +#error MAX_USER_RT_PRIO must not be greater than MAX_RT_PRIO +#endif + #define MAX_PRIO (MAX_RT_PRIO + 40) #define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO)) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO 2005-07-27 14:47 ` [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO Steven Rostedt @ 2005-07-27 15:05 ` Steven Rostedt 2005-07-27 18:52 ` Ingo Molnar 1 sibling, 0 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 15:05 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, Linus Torvalds, LKML The subject should be "[PATCH] safety check of MAX_RT_PRIO >= MAX_USER_RT_PRIO". Ever since I've upgraded my debian unstable, I've lost my spell checking! -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO 2005-07-27 14:47 ` [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO Steven Rostedt 2005-07-27 15:05 ` Steven Rostedt @ 2005-07-27 18:52 ` Ingo Molnar 1 sibling, 0 replies; 50+ messages in thread From: Ingo Molnar @ 2005-07-27 18:52 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Linus Torvalds, Andrew Morton * Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 2005-07-27 at 16:33 +0200, Ingo Molnar wrote: > > i'm not excluding that this will become necessary in the future. We > > should also add the safety check to sched.h - all i'm suggesting is to > > not make it a .config option just now, because that tends to be fiddled > > with. > > So Ingo, would you agree that at least this patch should go in? yeah. Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:33 ` Ingo Molnar 2005-07-27 14:47 ` [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO Steven Rostedt @ 2005-07-27 14:53 ` Esben Nielsen 2005-07-27 15:02 ` Steven Rostedt 2005-07-27 16:09 ` K.R. Foley 1 sibling, 2 replies; 50+ messages in thread From: Esben Nielsen @ 2005-07-27 14:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton On Wed, 27 Jul 2005, Ingo Molnar wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > Perfectly understood. I've had two customers ask me to increase the > > priorities for them, but those where custom kernels, and a config > > option wasn't necessary. But since I've had customers asking, I > > thought that this might be something that others want. But I deal > > with a niche market, and what my customers want might not be what > > everyone wants. (hence the RFC in the subject). > > > > So if there are others out there that would prefer to change their > > priority ranges, speak now otherwise this patch will go by the waste > > side. > > i'm not excluding that this will become necessary in the future. We > should also add the safety check to sched.h - all i'm suggesting is to > not make it a .config option just now, because that tends to be fiddled > with. > Isn't there a way to mark it "warning! warning! dangerous!" ? Anyway: I think 100 RT priorities is way overkill - and slowing things down by making the scheduler checking more empty slots in the runqueue. Default ought to be 10. In practise it will be very hard to have a task at the lower RT priority behaving real-time with 99 higher priority tasks around. I find it hard to believe that somebody has an RT app needing more than 10 priorities and can't do with RR or FIFO scheduling within a fewer number of prorities. Esben > Ingo > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:53 ` [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Esben Nielsen @ 2005-07-27 15:02 ` Steven Rostedt 2005-07-27 16:09 ` K.R. Foley 1 sibling, 0 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 15:02 UTC (permalink / raw) To: Esben Nielsen; +Cc: Ingo Molnar, LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 16:53 +0200, Esben Nielsen wrote: > Isn't there a way to mark it "warning! warning! dangerous!" ? Sure, config MAX_RT_PRIO int "Maximum RT priority (DANGEROUS!)" > > Anyway: I think 100 RT priorities is way overkill - and slowing things > down by making the scheduler checking more empty slots in the runqueue. > Default ought to be 10. In practise it will be very hard to have > a task at the lower RT priority behaving real-time with 99 higher > priority tasks around. I find it hard to believe that somebody has an RT > app needing more than 10 priorities and can't do with RR or FIFO > scheduling within a fewer number of prorities. I've wondered this too, especially when customers ask for more. I never question them (don't bite the hand that feeds you), but I really think that they like to prioritize things by 10. So you go from 10 20 30 instead, leaving room in between incase you forget something. Kind of like the old BASIC programming days with line numbers. There's also the case that you have groups of tasks that are all higher in priority than other groups. But the tasks themselves have priorities with each other. So you group the tasks in a range where they are greater than or less then other task groups. And then you can arrange the tasks yourself. But this all gets quite complex and then leads to more custom kernels :-) -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:53 ` [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Esben Nielsen 2005-07-27 15:02 ` Steven Rostedt @ 2005-07-27 16:09 ` K.R. Foley 2005-07-27 17:01 ` Esben Nielsen 1 sibling, 1 reply; 50+ messages in thread From: K.R. Foley @ 2005-07-27 16:09 UTC (permalink / raw) To: Esben Nielsen Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton Esben Nielsen wrote: > On Wed, 27 Jul 2005, Ingo Molnar wrote: > > >>* Steven Rostedt <rostedt@goodmis.org> wrote: >> >> >>>Perfectly understood. I've had two customers ask me to increase the >>>priorities for them, but those where custom kernels, and a config >>>option wasn't necessary. But since I've had customers asking, I >>>thought that this might be something that others want. But I deal >>>with a niche market, and what my customers want might not be what >>>everyone wants. (hence the RFC in the subject). >>> >>>So if there are others out there that would prefer to change their >>>priority ranges, speak now otherwise this patch will go by the waste >>>side. >> >>i'm not excluding that this will become necessary in the future. We >>should also add the safety check to sched.h - all i'm suggesting is to >>not make it a .config option just now, because that tends to be fiddled >>with. >> > > Isn't there a way to mark it "warning! warning! dangerous!" ? > > Anyway: I think 100 RT priorities is way overkill - and slowing things > down by making the scheduler checking more empty slots in the runqueue. > Default ought to be 10. In practise it will be very hard to have > a task at the lower RT priority behaving real-time with 99 higher > priority tasks around. I find it hard to believe that somebody has an RT > app needing more than 10 priorities and can't do with RR or FIFO > scheduling within a fewer number of prorities. > > Esben > Actually, is it really that slow to search a bitmap for a slot that needs processing? I work on real-time test stands which are less of an embedded system and more of a real Unix system that require determinism. It is very nice in some cases to have more than 10 RT priorities to work with. -- kr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 16:09 ` K.R. Foley @ 2005-07-27 17:01 ` Esben Nielsen 2005-07-27 17:25 ` Steven Rostedt 2005-07-27 17:42 ` K.R. Foley 0 siblings, 2 replies; 50+ messages in thread From: Esben Nielsen @ 2005-07-27 17:01 UTC (permalink / raw) To: K.R. Foley Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton On Wed, 27 Jul 2005, K.R. Foley wrote: > Esben Nielsen wrote: > > On Wed, 27 Jul 2005, Ingo Molnar wrote: > > > > > >>* Steven Rostedt <rostedt@goodmis.org> wrote: > >> > >> > >>>Perfectly understood. I've had two customers ask me to increase the > >>>priorities for them, but those where custom kernels, and a config > >>>option wasn't necessary. But since I've had customers asking, I > >>>thought that this might be something that others want. But I deal > >>>with a niche market, and what my customers want might not be what > >>>everyone wants. (hence the RFC in the subject). > >>> > >>>So if there are others out there that would prefer to change their > >>>priority ranges, speak now otherwise this patch will go by the waste > >>>side. > >> > >>i'm not excluding that this will become necessary in the future. We > >>should also add the safety check to sched.h - all i'm suggesting is to > >>not make it a .config option just now, because that tends to be fiddled > >>with. > >> > > > > Isn't there a way to mark it "warning! warning! dangerous!" ? > > > > Anyway: I think 100 RT priorities is way overkill - and slowing things > > down by making the scheduler checking more empty slots in the runqueue. > > Default ought to be 10. In practise it will be very hard to have > > a task at the lower RT priority behaving real-time with 99 higher > > priority tasks around. I find it hard to believe that somebody has an RT > > app needing more than 10 priorities and can't do with RR or FIFO > > scheduling within a fewer number of prorities. > > > > Esben > > > > Actually, is it really that slow to search a bitmap for a slot that > needs processing? No, it is ultra fast - but done very often. > I work on real-time test stands which are less of an > embedded system and more of a real Unix system that require determinism. > It is very nice in some cases to have more than 10 RT priorities to work > with. What for? Why can't you use FIFO at the same priorities for some of your tasks? I pretty much quess you have a very few tasks which have some high requirements. The rest of you "RT" task could easily share the lowest RT priority. FIFO would also be more effective as you will have context switches. This about multiple priorities probably comes from an ordering of tasks: You have a lot of task. You have a feeling about which one ought to be more important than the other. Thus you end of with an ordered list of tasks. BUT when you boil it down to what RT is all about, namely meeting your deadlines, it doesn't matter after the 5-10 priorities because the 5-10 priorities have introduced a lot of jitter to the rest of the tasks anyway. You can just as well just put them at the same priority. Esben > > -- > kr > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 17:01 ` Esben Nielsen @ 2005-07-27 17:25 ` Steven Rostedt 2005-07-27 21:32 ` Esben Nielsen 2005-07-28 7:22 ` Ingo Molnar 2005-07-27 17:42 ` K.R. Foley 1 sibling, 2 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 17:25 UTC (permalink / raw) To: Esben Nielsen Cc: K.R. Foley, Ingo Molnar, LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 19:01 +0200, Esben Nielsen wrote: > > What for? Why can't you use FIFO at the same priorities for some of your > tasks? I pretty much quess you have a very few tasks which have some high > requirements. The rest of you "RT" task could easily share the lowest RT > priority. FIFO would also be more effective as you will have context > switches. > > This about multiple priorities probably comes from an ordering of tasks: > You have a lot of task. You have a feeling about which one ought to be > more important than the other. Thus you end of with an ordered list of > tasks. BUT when you boil it down to what RT is all about, namely > meeting your deadlines, it doesn't matter after the 5-10 priorities > because the 5-10 priorities have introduced a lot of jitter to the rest > of the tasks anyway. You can just as well just put them at the same > priority. Nope, I wouldn't agree with you here. If you have tasks that will run periodically, at different frequencies, you need to order them. And each task would probably need a different priority. FIFO is very dangerous since it doesn't release a task until that task voluntarily sleeps. A colleague of mine, well actually the VP of my company of the time, Doug Locke, gave me a perfect example. If you have a program that runs a nuclear power plant that needs to wake up and run 4 seconds every 10 seconds, and on that same computer you have a program running a washing machine that needs to wake up every 3 seconds and run for one second (I'm using seconds just to make the example simple). Which process gets the higher priority? The answer is the washing machine. Rational: If the power plant was higher priority, the washing machine would fail almost every time, since the power plant program would run for 4 seconds, and since the cycle of the washing machine is 3 seconds, it would fail everytime the nuclear power plant program ran. Now if you have the washing machine run in it's cycle, the nuclear power plant can easily make the 4 seconds ever 10 seconds, even when it is interrupted by the washing machine. Doug also mentioned that you really want to have every task with a different priority, so it makes sense to have a lot of priorities. I can't remember why he said this, but I'm sure you and I can find out by searching through his papers. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 17:25 ` Steven Rostedt @ 2005-07-27 21:32 ` Esben Nielsen 2005-07-28 12:17 ` Steven Rostedt 2005-07-28 7:22 ` Ingo Molnar 1 sibling, 1 reply; 50+ messages in thread From: Esben Nielsen @ 2005-07-27 21:32 UTC (permalink / raw) To: Steven Rostedt Cc: K.R. Foley, Ingo Molnar, LKML, Linus Torvalds, Andrew Morton On Wed, 27 Jul 2005, Steven Rostedt wrote: > On Wed, 2005-07-27 at 19:01 +0200, Esben Nielsen wrote: > > > > What for? Why can't you use FIFO at the same priorities for some of your > > tasks? I pretty much quess you have a very few tasks which have some high > > requirements. The rest of you "RT" task could easily share the lowest RT > > priority. FIFO would also be more effective as you will have context > > switches. > > > > This about multiple priorities probably comes from an ordering of tasks: > > You have a lot of task. You have a feeling about which one ought to be > > more important than the other. Thus you end of with an ordered list of > > tasks. BUT when you boil it down to what RT is all about, namely > > meeting your deadlines, it doesn't matter after the 5-10 priorities > > because the 5-10 priorities have introduced a lot of jitter to the rest > > of the tasks anyway. You can just as well just put them at the same > > priority. > > Nope, I wouldn't agree with you here. If you have tasks that will run > periodically, at different frequencies, you need to order them. And each > task would probably need a different priority. FIFO is very dangerous > since it doesn't release a task until that task voluntarily sleeps. > > A colleague of mine, well actually the VP of my company of the time, > Doug Locke, gave me a perfect example. If you have a program that runs > a nuclear power plant that needs to wake up and run 4 seconds every 10 > seconds, and on that same computer you have a program running a washing > machine that needs to wake up every 3 seconds and run for one second > (I'm using seconds just to make the example simple). Which process gets > the higher priority? The answer is the washing machine. > > Rational: If the power plant was higher priority, the washing machine > would fail almost every time, since the power plant program would run > for 4 seconds, and since the cycle of the washing machine is 3 seconds, > it would fail everytime the nuclear power plant program ran. Now if you > have the washing machine run in it's cycle, the nuclear power plant can > easily make the 4 seconds ever 10 seconds, even when it is interrupted > by the washing machine. > This is rate monotonic scheduling (http://en.wikipedia.org/wiki/Rate-monotonic_scheduling). ((Notice: The article gets it wrong on the priority inheritance/ceiling stuff...)) (Notice: The fewer tasks the higher the theoretical max to the CPU utialization :-) In theory it works fine, but there is some drawbacks when you go to reality: 1) You have to know the running time of all your tasks. 2) The running time is assumed contant. 3) You don't take into account the time it takes to perform a task switch. 4) Mutual exclution isn't considered. 5) What if things aren't periodic? To make this work according to the theory you have to make a really, really detailled analysis of your system. It is not really posible for more than a few tasks. In practise you will have to be very much below the theoretical limit for CPU utialization to be safe. And my claim is: All the different periodic jobs you want to perform can most often easily be put into groups like 1ms, 5ms, 20ms, 100ms groups. In each group you can easily run with fifo preemption - and even within one OS thread. Think about it: If you have two tasks with close to the same period and one runs too long for the other to meets it's deadlines within fifo setup, you are bound to be very close to the limits no matter which one you give the highest priority. > Doug also mentioned that you really want to have every task with a > different priority, so it makes sense to have a lot of priorities. I > can't remember why he said this, but I'm sure you and I can find out by > searching through his papers. > On the contrary. Especially if you have mutual exclustion. If you run two tasks at the same priority with fifo you don't hit congestion on you mutexes (between those two tasks at least). If you give one of them just one higher priority it will preempt the other. You thus risk congestion - which is an expensive thing. Thus by giving your tasks different priorities you risk that you system can't meet the deadlines due to the extra CPU used!! > -- Steve > > > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 21:32 ` Esben Nielsen @ 2005-07-28 12:17 ` Steven Rostedt 0 siblings, 0 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 12:17 UTC (permalink / raw) To: Esben Nielsen Cc: K.R. Foley, Ingo Molnar, LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 23:32 +0200, Esben Nielsen wrote: > > This is rate monotonic scheduling > (http://en.wikipedia.org/wiki/Rate-monotonic_scheduling). Bingo! you win the cigar! > ((Notice: The article gets it wrong on the priority inheritance/ceiling > stuff...)) It's too early in the morning for me. What's wrong with this? > (Notice: The fewer tasks the higher the theoretical max to the CPU > utialization :-) > In theory it works fine, but there is some drawbacks when you go to > reality: > 1) You have to know the running time of all your tasks. We look at the maximum running time to finish the job. > 2) The running time is assumed contant. Just the maximum. > 3) You don't take into account the time it takes to perform a task switch. We add that on top of the running time. > 4) Mutual exclution isn't considered. We also add a buffer on top of running time for all locks held. > 5) What if things aren't periodic? They are not given a high priority. > > To make this work according to the theory you have to make a really, > really detailled analysis of your system. It is not really posible for > more than a few tasks. In practise you will have to be very much below the > theoretical limit for CPU utialization to be safe. I just write the code, my customers are the expert on their system :-) But I do know that we do not go to the limit of CPU utilization. There is always a buffer on all RT projects I've been on. > > And my claim is: > All the different periodic jobs you want to perform can most often easily > be put into groups like 1ms, 5ms, 20ms, 100ms groups. In each group you > can easily run with fifo preemption - and even within one OS thread. > > Think about it: If you have two tasks with close to the same period and > one runs too long for the other to meets it's deadlines within fifo setup, > you are bound to be very close to the limits no matter which one you give > the highest priority. Sounds reasonable. We do do something similar. But we still use different priorities with threads with the same period. It's when the real word priority comes in, and you forget the math. We really do want that nuclear power plant running even if the washing machine should run! > > > Doug also mentioned that you really want to have every task with a > > different priority, so it makes sense to have a lot of priorities. I > > can't remember why he said this, but I'm sure you and I can find out by > > searching through his papers. > > > > On the contrary. Especially if you have mutual exclustion. If you run two > tasks at the same priority with fifo you don't hit congestion on you > mutexes (between those two tasks at least). If you give one of them just > one higher priority it will preempt the other. You thus risk congestion - > which is an expensive thing. Thus by giving your tasks different > priorities you risk that you system can't meet the deadlines due to the > extra CPU used!! I wont argue here. I let my customers analyze the system and I just write the code for them. I only make suggestions, and just do what I'm told. You brought up some interesting points, and if I ever get time, I will need to write some utilities to show the pros and cons of each approach. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 17:25 ` Steven Rostedt 2005-07-27 21:32 ` Esben Nielsen @ 2005-07-28 7:22 ` Ingo Molnar 2005-07-28 11:53 ` Steven Rostedt 1 sibling, 1 reply; 50+ messages in thread From: Ingo Molnar @ 2005-07-28 7:22 UTC (permalink / raw) To: Steven Rostedt Cc: Esben Nielsen, K.R. Foley, LKML, Linus Torvalds, Andrew Morton * Steven Rostedt <rostedt@goodmis.org> wrote: > A colleague of mine, well actually the VP of my company of the time, > Doug Locke, gave me a perfect example. If you have a program that > runs a nuclear power plant that needs to wake up and run 4 seconds > every 10 seconds, and on that same computer you have a program running > a washing machine that needs to wake up every 3 seconds and run for > one second (I'm using seconds just to make the example simple). Which > process gets the higher priority? The answer is the washing machine. > > Rational: If the power plant was higher priority, the washing machine > would fail almost every time, since the power plant program would run > for 4 seconds, and since the cycle of the washing machine is 3 > seconds, it would fail everytime the nuclear power plant program ran. > Now if you have the washing machine run in it's cycle, the nuclear > power plant can easily make the 4 seconds ever 10 seconds, even when > it is interrupted by the washing machine. nitpicking: i guess the answer also depends on what the precise requirement is. If the requirement is 'run for 4 seconds every 10 seconds, uninterrupted, else the power plant melts down', i'd sure not make the washing machine process the higher priority one ;-) (also, i'd give the power plant process higher priority even if the requirement is not as strict, just from a risk POV: what if the washing machine control program is buggy and got into an infinite loop somewhere.) Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 7:22 ` Ingo Molnar @ 2005-07-28 11:53 ` Steven Rostedt 0 siblings, 0 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 11:53 UTC (permalink / raw) To: Ingo Molnar Cc: Esben Nielsen, K.R. Foley, LKML, Linus Torvalds, Andrew Morton On Thu, 2005-07-28 at 09:22 +0200, Ingo Molnar wrote: > nitpicking: i guess the answer also depends on what the precise > requirement is. If the requirement is 'run for 4 seconds every 10 > seconds, uninterrupted, else the power plant melts down', i'd sure not > make the washing machine process the higher priority one ;-) > > (also, i'd give the power plant process higher priority even if the > requirement is not as strict, just from a risk POV: what if the washing > machine control program is buggy and got into an infinite loop > somewhere.) Doug also said that you're an idiot if you run a washing machine from the same computer you run a nuclear power plant from :-) The point that he was making though is that if you want a system that runs without flaws, you don't always prioritize the same way the real world would prioritize. You need to do it with math. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 17:01 ` Esben Nielsen 2005-07-27 17:25 ` Steven Rostedt @ 2005-07-27 17:42 ` K.R. Foley 2005-07-28 9:59 ` Esben Nielsen 1 sibling, 1 reply; 50+ messages in thread From: K.R. Foley @ 2005-07-27 17:42 UTC (permalink / raw) To: Esben Nielsen Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton Esben Nielsen wrote: > On Wed, 27 Jul 2005, K.R. Foley wrote: > > >>Esben Nielsen wrote: >> >>>On Wed, 27 Jul 2005, Ingo Molnar wrote: >>> >>> >>> >>>>* Steven Rostedt <rostedt@goodmis.org> wrote: >>>> >>>> >>>> >>>>>Perfectly understood. I've had two customers ask me to increase the >>>>>priorities for them, but those where custom kernels, and a config >>>>>option wasn't necessary. But since I've had customers asking, I >>>>>thought that this might be something that others want. But I deal >>>>>with a niche market, and what my customers want might not be what >>>>>everyone wants. (hence the RFC in the subject). >>>>> >>>>>So if there are others out there that would prefer to change their >>>>>priority ranges, speak now otherwise this patch will go by the waste >>>>>side. >>>> >>>>i'm not excluding that this will become necessary in the future. We >>>>should also add the safety check to sched.h - all i'm suggesting is to >>>>not make it a .config option just now, because that tends to be fiddled >>>>with. >>>> >>> >>>Isn't there a way to mark it "warning! warning! dangerous!" ? >>> >>>Anyway: I think 100 RT priorities is way overkill - and slowing things >>>down by making the scheduler checking more empty slots in the runqueue. >>>Default ought to be 10. In practise it will be very hard to have >>>a task at the lower RT priority behaving real-time with 99 higher >>>priority tasks around. I find it hard to believe that somebody has an RT >>>app needing more than 10 priorities and can't do with RR or FIFO >>>scheduling within a fewer number of prorities. >>> >>>Esben >>> >> >>Actually, is it really that slow to search a bitmap for a slot that >>needs processing? > > No, it is ultra fast - but done very often. > :-) I'm not going to argue this. You could definitely save a few instructions if you had fewer priorities and it probably wouldn't be very hard to create a patch to do this. > >>I work on real-time test stands which are less of an >>embedded system and more of a real Unix system that require determinism. >>It is very nice in some cases to have more than 10 RT priorities to work >>with. > > > What for? Why can't you use FIFO at the same priorities for some of your > tasks? I pretty much quess you have a very few tasks which have some high > requirements. The rest of you "RT" task could easily share the lowest RT > priority. FIFO would also be more effective as you will have context > switches. > > This about multiple priorities probably comes from an ordering of tasks: > You have a lot of task. You have a feeling about which one ought to be > more important than the other. Thus you end of with an ordered list of > tasks. BUT when you boil it down to what RT is all about, namely > meeting your deadlines, it doesn't matter after the 5-10 priorities > because the 5-10 priorities have introduced a lot of jitter to the rest > of the tasks anyway. You can just as well just put them at the same > priority. > > Esben All of the RT priorities that we have are not absolutely necessary. As I think Steven pointed out in another email, it is nice though to be able to priortize tasks using large jumps in priorities and then being able to fill in tasks that are dependent on other tasks in between. Even if you think of nothing but the IRQ handlers, the 5-10 priorities quickly get crowded without any user tasks. -- kr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 17:42 ` K.R. Foley @ 2005-07-28 9:59 ` Esben Nielsen 0 siblings, 0 replies; 50+ messages in thread From: Esben Nielsen @ 2005-07-28 9:59 UTC (permalink / raw) To: K.R. Foley Cc: Ingo Molnar, Steven Rostedt, LKML, Linus Torvalds, Andrew Morton On Wed, 27 Jul 2005, K.R. Foley wrote: > Esben Nielsen wrote: > [...] > > All of the RT priorities that we have are not absolutely necessary. As I > think Steven pointed out in another email, it is nice though to be able > to priortize tasks using large jumps in priorities and then being able > to fill in tasks that are dependent on other tasks in between. For portability you shouldn't hardcode your priorities anyway. You need some sort of abstraction layer. It should be very simple to code one such you at least can avoid having gaps in your priorities. Making it figure out who can share priorities is probably harder. Under all circumstances: You are stressing the system run-time because you didn't do a proper job compile and boot time. > Even if > you think of nothing but the IRQ handlers, the 5-10 priorities quickly > get crowded without any user tasks. Why do the irq-handlers need _different_ priorities? Do they really have to preempt each other? It is likely that the longest handler runs longer than the accepted latency for the most critical handler. Thus the most critical handler has to preempt that one. But I bet you don't have a system with 10 interrupts source where handler 1 needs to preempt handler 2, handler 2 needs to preempt handler 3 etc. At most you have a system where handlers 1-3 need to preempt handlers 4-10 (and the application) handler 4 and 5 need to preempt handler 6-10, while handlers 6-10 don't need to preempt any of the other handlers. In that case you only need 3 priorities, not 5-10. (The highest priority can very well be hard-irq context :-). Then add 2 RT priorities for your application. You need one thread to handle the data from the high priority interrupt and 1 for the middle interrupt. You thus have 5 RT priorities 1: handlers 1-3 (can be in hard irq) 2: application thread 1 3: handlers 4 and 5 4: application thread 2 5: the rest of the irq handlers Even as your application grows big it doesn't help throwing in more priorities. If a task runs for very long and have a low latency limit you will have a hard time no matter what you do. If all the low latency stuff runs sufficient fast it can just as well run with FIFO priority wrt. each other. And even if you split the stuff up in seperate OS threads giving them the same RT priority and FIFO policy will make things run faster due to fewer task switches. Preemption is expensive. Even though you want it, you should design your system such it doesn't happen too often. Esben > > > -- > kr > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:17 ` Ingo Molnar 2005-07-27 14:26 ` Steven Rostedt @ 2005-07-27 14:28 ` Steven Rostedt 2005-07-27 14:38 ` Ingo Molnar 2005-07-28 1:42 ` Matt Mackall 2 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 14:28 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 16:17 +0200, Ingo Molnar wrote: > i'd not do this patch, mainly because the '100 priority levels' thing is > pretty much an assumption in lots of userspace code. I must argue though, any user app that assumes 100 is the max prio is already broken. That's why there are system calls to get the actual range. Maybe it would be good to change the range to find the apps that break. And then fix them. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:28 ` Steven Rostedt @ 2005-07-27 14:38 ` Ingo Molnar 2005-07-27 14:46 ` Steven Rostedt 0 siblings, 1 reply; 50+ messages in thread From: Ingo Molnar @ 2005-07-27 14:38 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Linus Torvalds, Andrew Morton * Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 2005-07-27 at 16:17 +0200, Ingo Molnar wrote: > > i'd not do this patch, mainly because the '100 priority levels' thing is > > pretty much an assumption in lots of userspace code. > > I must argue though, any user app that assumes 100 is the max prio is > already broken. That's why there are system calls to get the actual > range. Maybe it would be good to change the range to find the apps > that break. And then fix them. a fair number of apps assume that there's _at least_ 100 levels of priorities. The moment you have a custom kernel that offers more than 100 priorities, there will be apps that assume that there are more than 100 priority levels, and will break if there are less. Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:38 ` Ingo Molnar @ 2005-07-27 14:46 ` Steven Rostedt 2005-07-28 7:33 ` Ingo Molnar 0 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-27 14:46 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML, Linus Torvalds, Andrew Morton On Wed, 2005-07-27 at 16:38 +0200, Ingo Molnar wrote: > a fair number of apps assume that there's _at least_ 100 levels of > priorities. The moment you have a custom kernel that offers more than > 100 priorities, there will be apps that assume that there are more than > 100 priority levels, and will break if there are less. That's not the same. The apps on my computer right now should not assume that I have 100 priorities. Since I'm free to work with any kernel that I want. The customers that ask for a custom kernel are also writing custom apps for a specific product. Things can be assumed more since the apps are not generic tools that are running on desktops. In fact, some of these apps require special hardware to work (at least a driver to imitate the hardware). -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:46 ` Steven Rostedt @ 2005-07-28 7:33 ` Ingo Molnar 0 siblings, 0 replies; 50+ messages in thread From: Ingo Molnar @ 2005-07-28 7:33 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Linus Torvalds, Andrew Morton * Steven Rostedt <rostedt@goodmis.org> wrote: > > a fair number of apps assume that there's _at least_ 100 levels of > > priorities. The moment you have a custom kernel that offers more than > > 100 priorities, there will be apps that assume that there are more than > > 100 priority levels, and will break if there are less. > > That's not the same. The apps on my computer right now should not > assume that I have 100 priorities. Since I'm free to work with any > kernel that I want. The customers that ask for a custom kernel are > also writing custom apps for a specific product. Things can be > assumed more since the apps are not generic tools that are running on > desktops. In fact, some of these apps require special hardware to > work (at least a driver to imitate the hardware). what i mean is that once there's such a .config option in the generic kernel, the cat is out of the bag and there's no way back. In other words, increasing the number of priorities is an irreversible change, in terms of binary compatibility. there's absolutely no problem with tweaking your kernel for special purposes, and i support all the patches to make it easier (as long as there's no cost to the kernel - and there's no such cost so far). But whenever some change in the generic kernel has a user-visible effect, especially one that is pretty much irreversible, we have to be very careful. (I think in the longer term we might want to do something more clever than a blanket increase of the priority levels, we could e.g. put a mapping layer between user priorities and kernel priorities, and in fact we could make 'user-visible' RT priorities a bit more structured, like a hierarchy of levels - while keeping the in-kernel priorities a fast & flat thing (which could have more than 100 levels - but that would not be visible externally). There's some demand for that, but we'll have to wait and see for real demand and real code to crystalise out.) Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:17 ` Ingo Molnar 2005-07-27 14:26 ` Steven Rostedt 2005-07-27 14:28 ` Steven Rostedt @ 2005-07-28 1:42 ` Matt Mackall 2 siblings, 0 replies; 50+ messages in thread From: Matt Mackall @ 2005-07-28 1:42 UTC (permalink / raw) To: Ingo Molnar; +Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton On Wed, Jul 27, 2005 at 04:17:54PM +0200, Ingo Molnar wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > The following patch makes the MAX_RT_PRIO and MAX_USER_RT_PRIO > > configurable from the make *config. This is more of a proposal since > > I'm not really sure where in Kconfig this would best fit. I don't see > > why these options shouldn't be user configurable without going into > > the kernel headers to change them. > > i'd not do this patch, mainly because the '100 priority levels' thing is > pretty much an assumption in lots of userspace code. The patch to make > it easier to redefine it is of course fine and was accepted, but i dont > think we want to make it explicit via .config. > > It's a bit like with the 3:1 split: you can redefine it easily via > include files, but it's not configurable via .config, because many > people would just play with it and would see things break. > > so unless there's really a desire from distributions to actually change > the 100 RT-prio levels (and i dont sense such a desire), we shouldnt do > this. The queues take a fairly substantial amount of memory. I've had an option for configuring this under CONFIG_EMBEDDED in the -tiny tree for quite some time. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-27 14:13 [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Steven Rostedt 2005-07-27 14:17 ` Ingo Molnar @ 2005-07-28 1:00 ` Daniel Walker 2005-07-28 1:20 ` Lee Revell 2005-07-28 1:25 ` Steven Rostedt 1 sibling, 2 replies; 50+ messages in thread From: Daniel Walker @ 2005-07-28 1:00 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar Don't you break sched_find_first_bit() , seems it's dependent on a 140-bit bitmap . Daniel On Wed, 2005-07-27 at 10:13 -0400, Steven Rostedt wrote: > The following patch makes the MAX_RT_PRIO and MAX_USER_RT_PRIO > configurable from the make *config. This is more of a proposal since > I'm not really sure where in Kconfig this would best fit. I don't see > why these options shouldn't be user configurable without going into the > kernel headers to change them. > > Also, is there a way in the Kconfig to force the checking of > MAX_USER_RT_PRIO <= MAX_RT_PRIO? > > -- Steve > > (Patched against 2.6.12.2) > > Index: vanilla_kernel/include/linux/sched.h > =================================================================== > --- vanilla_kernel/include/linux/sched.h (revision 263) > +++ vanilla_kernel/include/linux/sched.h (working copy) > @@ -389,9 +389,13 @@ > * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO. > */ > > -#define MAX_USER_RT_PRIO 100 > -#define MAX_RT_PRIO MAX_USER_RT_PRIO > +#define MAX_USER_RT_PRIO CONFIG_MAX_USER_RT_PRIO > +#define MAX_RT_PRIO CONFIG_MAX_RT_PRIO > > +#if MAX_USER_RT_PRIO > MAX_RT_PRIO > +#error MAX_USER_RT_PRIO must not be greater than MAX_RT_PRIO > +#endif > + > #define MAX_PRIO (MAX_RT_PRIO + 40) > > #define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO)) > Index: vanilla_kernel/init/Kconfig > =================================================================== > --- vanilla_kernel/init/Kconfig (revision 263) > +++ vanilla_kernel/init/Kconfig (working copy) > @@ -162,6 +162,32 @@ > building a kernel for install/rescue disks or your system is very > limited in memory. > > +config MAX_RT_PRIO > + int "Maximum RT priority" > + default 100 > + help > + The real-time priority of threads that have the policy of SCHED_FIFO > + or SCHED_RR have a priority higher than normal threads. This range > + can be set here, where the range starts from 0 to MAX_RT_PRIO-1. > + If this range is higher than MAX_USER_RT_PRIO than kernel threads > + may have a higher priority than any user thread. > + > + This may be the same as MAX_USER_RT_PRIO, but do not set this > + to be less than MAX_USER_RT_PRIO. > + > +config MAX_USER_RT_PRIO > + int "Maximum User RT priority" > + default 100 > + help > + The real-time priority of threads that have the policy of SCHED_FIFO > + or SCHED_RR have a priority higher than normal threads. This range > + can be set here, where the range starts from 0 to MAX_USER_RT_PRIO-1. > + If this range is lower than MAX_RT_PRIO, than kernel threads may have > + a higher priority than any user thread. > + > + This may be the same as MAX_RT_PRIO, but do not set this to be > + greater than MAX_RT_PRIO. > + > config AUDIT > bool "Auditing support" > default y if SECURITY_SELINUX > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 1:00 ` Daniel Walker @ 2005-07-28 1:20 ` Lee Revell 2005-07-28 1:26 ` Steven Rostedt 2005-07-28 1:25 ` Steven Rostedt 1 sibling, 1 reply; 50+ messages in thread From: Lee Revell @ 2005-07-28 1:20 UTC (permalink / raw) To: Daniel Walker Cc: Steven Rostedt, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar On Wed, 2005-07-27 at 18:00 -0700, Daniel Walker wrote: > Don't you break sched_find_first_bit() , seems it's dependent on a > 140-bit bitmap . And doesn't POSIX specify 100 RT priority levels? Lee ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 1:20 ` Lee Revell @ 2005-07-28 1:26 ` Steven Rostedt 0 siblings, 0 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 1:26 UTC (permalink / raw) To: Lee Revell Cc: Daniel Walker, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar On Wed, 2005-07-27 at 21:20 -0400, Lee Revell wrote: > On Wed, 2005-07-27 at 18:00 -0700, Daniel Walker wrote: > > Don't you break sched_find_first_bit() , seems it's dependent on a > > 140-bit bitmap . > > And doesn't POSIX specify 100 RT priority levels? My customers don't care about POSIX :-) -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 1:00 ` Daniel Walker 2005-07-28 1:20 ` Lee Revell @ 2005-07-28 1:25 ` Steven Rostedt 2005-07-28 3:06 ` Steven Rostedt 1 sibling, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 1:25 UTC (permalink / raw) To: Daniel Walker; +Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar On Wed, 2005-07-27 at 18:00 -0700, Daniel Walker wrote: > Don't you break sched_find_first_bit() , seems it's dependent on a > 140-bit bitmap . Oops! I forgot about that. With my custom kernels I had to change this to use the generic find_first_bit routine. It's been a while since I made these changes. So when we really need to have custom settings, we would have to change this. I should have remembered this, since it did cause me couple of days of debugging. Anyway, I never did the measurements, but does anyone know what the performance difference is between find_first_bit and sched_find_first_bit? I guess I'll do it and report back later. This should also be in a comment around changing these settings. Thanks, -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 1:25 ` Steven Rostedt @ 2005-07-28 3:06 ` Steven Rostedt 2005-07-28 3:32 ` Steven Rostedt 0 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 3:06 UTC (permalink / raw) To: Daniel Walker; +Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, LKML [-- Attachment #1: Type: text/plain, Size: 2102 bytes --] On Wed, 2005-07-27 at 21:25 -0400, Steven Rostedt wrote: > On Wed, 2005-07-27 at 18:00 -0700, Daniel Walker wrote: > > Don't you break sched_find_first_bit() , seems it's dependent on a > > 140-bit bitmap . > > Oops! I forgot about that. With my custom kernels I had to change this > to use the generic find_first_bit routine. It's been a while since I > made these changes. So when we really need to have custom settings, we > would have to change this. I should have remembered this, since it did > cause me couple of days of debugging. Anyway, I never did the > measurements, but does anyone know what the performance difference is > between find_first_bit and sched_find_first_bit? I guess I'll do it and > report back later. This should also be in a comment around changing > these settings. OK, here's the benchmark. Attached is the program that I used, and compiled it this way: gcc -O2 -o ffb ffb.c and here's the results: clock speed = 00000000:7f30c931 2133903665 ticks per second last bit set generic ffb: 00000000:02755284 time: 0.019327615us sched ffb: 00000000:001e8f2b time: 0.000938529us first bit set generic ffb: 00000000:01ea41f0 time: 0.015056688us sched ffb: 00000000:001e8eff time: 0.000938509us /proc/cpuinfo shows that I have a SMP Authentic AMD running at 2133.635 MHz. The SMP doesn't matter for this test, but the Hz goes right in line to the above BS tick measure. I ran both the generic ffb and the sched_find_first_bit 1,000,000 times each and took the measurements of both. The sched_find_first_bit ran 20 times faster!!! So that is quite an improvement. Even when I set the first bit, it is 15 times faster. The time between the two for the sched_ffb is pretty much constant, where as the ffb is quicker the closer the bit is (as expected). But we are talking 19ns to do this, and a colleague of mine said that this is just eliminating a fart in the wind storm. But I still think that the ffb should be better for something like the scheduler. Well, I guess I can spend some more time playing with algorithms that can improve the ffb :-) -- Steve [-- Attachment #2: ffb.c --] [-- Type: text/x-csrc, Size: 2650 bytes --] #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #define unlikely(x) __builtin_expect(!!(x), 0) static inline int find_first_bit(const unsigned long *addr, unsigned size) { int d0, d1; int res; /* This looks at memory. Mark it volatile to tell gcc not to move it around */ __asm__ __volatile__( "xorl %%eax,%%eax\n\t" "repe; scasl\n\t" "jz 1f\n\t" "leal -4(%%edi),%%edi\n\t" "bsfl (%%edi),%%eax\n" "1:\tsubl %%ebx,%%edi\n\t" "shll $3,%%edi\n\t" "addl %%edi,%%eax" :"=a" (res), "=&c" (d0), "=&D" (d1) :"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory"); return res; } static inline unsigned long __ffs(unsigned long word) { __asm__("bsfl %1,%0" :"=r" (word) :"rm" (word)); return word; } static inline int sched_find_first_bit(const unsigned long *b) { if (unlikely(b[0])) return __ffs(b[0]); if (unlikely(b[1])) return __ffs(b[1]) + 32; if (unlikely(b[2])) return __ffs(b[2]) + 64; if (b[3]) return __ffs(b[3]) + 96; return __ffs(b[4]) + 128; } #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define rdtsc(low,high) \ __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)) static unsigned long array[5]; #define ITER 1000000 /* 1,000,000 times */ void testit(unsigned long *array, unsigned long long clock) { unsigned long long s1; unsigned long long e1; unsigned long long s2; unsigned long long e2; unsigned long long t1; unsigned long long t2; double f1; double f2; int i; int x; rdtscll(s1); for (i=0; i < ITER; i++) x = find_first_bit(array,140); rdtscll(e1); rdtscll(s2); for (i=0; i < ITER; i++) x = sched_find_first_bit(array); rdtscll(e2); t1 = e1 - s1; t2 = e2 - s2; f1 = (float)t1 / (float)clock; f2 = (float)t2 / (float)clock; /* * Since ITER is 1,000,000 the times will be in us. */ printf("generic ffb: %08lx:%08lx\n", (unsigned long)(t1>>32),(unsigned long)t1); printf("time: %.09fus\n",f1); printf("sched ffb: %08lx:%08lx\n", (unsigned long)(t2>>32),(unsigned long)t2); printf("time: %.09fus\n",f2); } int main(int argc, char **argv) { unsigned long long s; unsigned long long e; unsigned long long t; unsigned long long clock; /* * Calculate BS time, just to get an * idea of the tsc speed. */ rdtscll(s); sleep(8); rdtscll(e); t = e - s; t >>= 3; printf("clock speed = %08lx:%08lx %llu ticks per second\n", (unsigned long)(t>>32),(unsigned long)t, t); clock = t; array[4] = 0x80000000; printf("last bit set\n"); testit(array,clock); array[0] = 0x00000001; printf("\nfirst bit set\n"); testit(array,clock); exit(0); } ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 3:06 ` Steven Rostedt @ 2005-07-28 3:32 ` Steven Rostedt 2005-07-28 3:45 ` Steven Rostedt 0 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 3:32 UTC (permalink / raw) To: Daniel Walker; +Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, LKML [-- Attachment #1: Type: text/plain, Size: 2432 bytes --] New benchmarks with my own formula. I added my own find_first_bit algorithm, that still doesn't quite beat Ingo's sched_find_first_bit, but it does a number on the generic find_first_bit. I guess the compiler has finally beat what we do in asm. Here's my generic algorithm. static inline int my_find_first_bit(const unsigned long *b, unsigned size) { int x = 0; size += 31; do { if (unlikely(*b)) return __ffs(*b) + x; b++; x += 32; } while (x < size); return x-32; } And here's the new results of my benchmark: Comments added. /* * I put in what is returned to see what really is returned is * correct. Funny that sched_find_first_bit is wrong if there * isn't a bit set. But this shouldn't be called if there isn't * one set. */ sched=128 ffb=160 my=160 clock speed = 00000000:7f2f2cbb 2133798075 ticks per second /* Here I set the last bit of 5 32bit words. */ sched=159 ffb=159 my=159 last bit set generic ffb: 00000000:026a1cea time: 0.018984294us sched ffb: 00000000:001ee719 time: 0.000949125us /* * My time is 8ns, 8 times longer than Ingo's if the last * bit is set (most likely case), but still more than twice * as fast as ffb. */ my ffb: 00000000:0106a3dd time: 0.008066546us sched=0 ffb=0 my=0 first bit set generic ffb: 00000000:01ee8e2b time: 0.015189432us sched ffb: 00000000:001eea92 time: 0.000949542us /* * With the first bit set, I'm still slower than Ingo's * but only by 2, and I still beat the pants off of ffb. */ my ffb: 00000000:004d5de6 time: 0.002376190us Conclusion: it looks like it might be time to change the i386 generic ffb to something else. Oh, and if you are wondering... $ gcc -v Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c ++,java,f95,objc,ada,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --enable-nls --without-included-gettext --enable-threads=posix --program-suffix=-4.0 --enable-__cxa_atexit --enable-libstdcxx-allocator=mt --enable-clocale=gnu --enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-4.0-1.4.2.0/jre --enable-mpfr --disable-werror --enable-checking=release i486-linux-gnu Thread model: posix gcc version 4.0.1 (Debian 4.0.1-2) -- Steve Attached is new version of ffb.c [-- Attachment #2: ffb.c --] [-- Type: text/x-csrc, Size: 3385 bytes --] #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #define unlikely(x) __builtin_expect(!!(x), 0) static inline int find_first_bit(const unsigned long *addr, unsigned size) { int d0, d1; int res; /* This looks at memory. Mark it volatile to tell gcc not to move it around */ __asm__ __volatile__( "xorl %%eax,%%eax\n\t" "repe; scasl\n\t" "jz 1f\n\t" "leal -4(%%edi),%%edi\n\t" "bsfl (%%edi),%%eax\n" "1:\tsubl %%ebx,%%edi\n\t" "shll $3,%%edi\n\t" "addl %%edi,%%eax" :"=a" (res), "=&c" (d0), "=&D" (d1) :"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory"); return res; } static inline unsigned long __ffs(unsigned long word) { __asm__("bsfl %1,%0" :"=r" (word) :"rm" (word)); return word; } static inline int sched_find_first_bit(const unsigned long *b) { if (unlikely(b[0])) return __ffs(b[0]); if (unlikely(b[1])) return __ffs(b[1]) + 32; if (unlikely(b[2])) return __ffs(b[2]) + 64; if (b[3]) return __ffs(b[3]) + 96; return __ffs(b[4]) + 128; } static inline int my_find_first_bit(const unsigned long *b, unsigned size) { int x = 0; size += 31; do { if (unlikely(*b)) return __ffs(*b) + x; b++; x += 32; } while (x < size); return x-32; } #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define rdtsc(low,high) \ __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)) static unsigned long array[5]; #define ITER 1000000 /* 1,000,000 times */ void testit(unsigned long *array, unsigned long long clock) { unsigned long long s; unsigned long long e; unsigned long long t; double f; int i; int x; /* * Since ITER is 1,000,000 the times will be in us. */ rdtscll(s); for (i=0; i < ITER; i++) x = find_first_bit(array,140); rdtscll(e); t = e - s; f = (float)t / (float)clock; printf("generic ffb: %08lx:%08lx\n", (unsigned long)(t>>32),(unsigned long)t); printf("time: %.09fus\n",f); rdtscll(s); for (i=0; i < ITER; i++) x = sched_find_first_bit(array); rdtscll(e); t = e - s; f = (float)t / (float)clock; printf("sched ffb: %08lx:%08lx\n", (unsigned long)(t>>32),(unsigned long)t); printf("time: %.09fus\n",f); rdtscll(s); for (i=0; i < ITER; i++) x = my_find_first_bit(array,140); rdtscll(e); t = e - s; f = (float)t / (float)clock; printf("my ffb: %08lx:%08lx\n", (unsigned long)(t>>32),(unsigned long)t); printf("time: %.09fus\n",f); } int main(int argc, char **argv) { unsigned long long s; unsigned long long e; unsigned long long t; unsigned long long clock; printf("sched=%d ffb=%d my=%d\n", sched_find_first_bit(array), find_first_bit(array,140), my_find_first_bit(array,140)); /* * Calculate BS time, just to get an * idea of the tsc speed. */ rdtscll(s); sleep(8); rdtscll(e); t = e - s; t >>= 3; printf("clock speed = %08lx:%08lx %llu ticks per second\n", (unsigned long)(t>>32),(unsigned long)t, t); clock = t; array[4] = 0x80000000; printf("sched=%d ffb=%d my=%d\n", sched_find_first_bit(array), find_first_bit(array,140), my_find_first_bit(array,140)); printf("last bit set\n"); testit(array,clock); array[0] = 0x00000001; printf("sched=%d ffb=%d my=%d\n", sched_find_first_bit(array), find_first_bit(array,140), my_find_first_bit(array,140)); printf("\nfirst bit set\n"); testit(array,clock); exit(0); } ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 3:32 ` Steven Rostedt @ 2005-07-28 3:45 ` Steven Rostedt 2005-07-28 3:51 ` Nick Piggin 0 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 3:45 UTC (permalink / raw) To: Daniel Walker; +Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar On Wed, 2005-07-27 at 23:32 -0400, Steven Rostedt wrote: > New benchmarks with my own formula. > Take two: I ran this on my laptop (IBM ThinkPad G41 with an 3.3GHz Pentium 4 HT processor, yeah yeah, no battery life :-). But this has an older compiler. $ gcc -v Reading specs from /usr/lib/gcc-lib/i486-linux-gnu/3.3.6/specs Configured with: ../src/configure -v --enable-languages=c,c ++,java,f77,pascal,objc,ada,treelang --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared --enable-__cxa_atexit --with-system-zlib --enable-nls --without-included-gettext --enable-clocale=gnu --enable-debug --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc i486-linux-gnu Thread model: posix gcc version 3.3.6 (Debian 1:3.3.6-7) Here's the results on it: compiled again with "gcc -O2 -o ffb ffb.c" /* comments embedded */ sched=128 ffb=160 my=160 /* this is a 3.3GHz so clocks look good */ clock speed = 00000000:c5e54f4c 3320139596 ticks per second sched=159 ffb=159 my=159 last bit set generic ffb: 00000000:0a25b1d0 /* Hmm, all generic ffb is even slower? do I blame gcc or Intel? */ time: 0.051275712us sched ffb: 00000000:0023ae21 /* double Hmm, Ingo's is faster !? */ time: 0.000704289us my ffb: 00000000:019c458d /* look at this, mine is the same. Are the three bears coming around soon :-) */ time: 0.008137802us sched=0 ffb=0 my=0 first bit set generic ffb: 00000000:09766c3c time: 0.047816034us sched ffb: 00000000:0023c79f time: 0.000706254us my ffb: 00000000:005a8758 time: 0.001786939us OK, still looks like the generic ffb can be changed. Unless I'm missing something, this shows that I probably should be sending in a patch now to replace the find_first_bit. Ingo's sched_find_first_bit is still the winner, but that is customed to the scheduler, until we need a configurable priority. What do you think? -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable 2005-07-28 3:45 ` Steven Rostedt @ 2005-07-28 3:51 ` Nick Piggin 2005-07-28 11:43 ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt 0 siblings, 1 reply; 50+ messages in thread From: Nick Piggin @ 2005-07-28 3:51 UTC (permalink / raw) To: Steven Rostedt Cc: Daniel Walker, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar Steven Rostedt wrote: >OK, still looks like the generic ffb can be changed. Unless I'm missing >something, this shows that I probably should be sending in a patch now >to replace the find_first_bit. Ingo's sched_find_first_bit is still the >winner, but that is customed to the scheduler, until we need a >configurable priority. > >What do you think? > > > Sure, if it is correct and faster. Just don't mark the (*b) test as unlikely. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 3:51 ` Nick Piggin @ 2005-07-28 11:43 ` Steven Rostedt 2005-07-28 12:45 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 11:43 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, LKML, Daniel Walker [-- Attachment #1: Type: text/plain, Size: 5133 bytes --] In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable" I discovered that a C version of find_first_bit is faster than the asm version now when compiled against gcc 3.3.6 and gcc 4.0.1 (both from versions of Debian unstable). I wrote a benchmark (attached) that runs the code 1,000,000 times. First I do it with no bits set, followed by the last bit set, a middle bit set and then the first bit set. Only when no bits are set is the asm version of find_first_bit faster and that is only when I ran it with gcc 4.0.1 (is gcc at fault here?). I haven't spent any time actually looking at what gcc produces. I only looked at the measurements. I compiled this with "gcc -O2 -o ffb ffb.c". And here's the output. On an AMD 2.2GHz SMP machine (SMP shouldn't affect the result here). This is running gcc 4.0.1 /* comments embedded */ /* ticks match speed */ clock speed = 00000000:7f31d02a 2133970986 ticks per second /* generic ffb (or just ffb) is the code that is currently in the system * my ffb (or just my) is my new version that I'm submitting. * my ffb 2 (or just my2) is my version playing with unlikely around an * condition. */ no bit set ffb=320 my=320 my2=320 generic ffb: 00000000:02e33f90 time: 0.022702922us my ffb: 00000000:02ee66e7 time: 0.023045461us my ffb 2: 00000000:032d63b9 time: 0.024979860us /* * The above shows that the original beats my version when no bit is * set. */ last bit set ffb=319 my=319 my2=319 generic ffb: 00000000:0382a116 time: 0.027597643us my ffb: 00000000:0204c4a9 time: 0.015870375us my ffb 2: 00000000:03244a1b time: 0.024700391us /* * Here we see that there's quite an improvement over normal ffb when * the last bit is set. */ middle bit set ffb=159 my=159 my2=159 generic ffb: 00000000:02ce2b78 time: 0.022055584us my ffb: 00000000:01241c5b time: 0.008970962us my ffb 2: 00000000:016171ff time: 0.010854596us /* * Again, there's quite an improvement when a middle bit is set. */ first bit set ffb=0 my=0 my2=0 generic ffb: 00000000:0232456a time: 0.017267808us my ffb: 00000000:003dd354 time: 0.001898712us my ffb 2: 00000000:009d1f74 time: 0.004825372us /* * When the first bit is set, there's ever a greater improvement. */ Now for the results on my laptop with a Pentium 4 HT 3.3GZ. Running gcc 3.3.6 clock speed = 00000000:c5de80ef 3319693551 ticks per second no bit set ffb=320 my=320 my2=320 generic ffb: 00000000:0aba64db time: 0.054218162us my ffb: 00000000:055f6c73 time: 0.027153036us my ffb 2: 00000000:052e753e time: 0.026186379us /* * Now we see even when no bits are set, my version beats the asm one. */ last bit set ffb=319 my=319 my2=319 generic ffb: 00000000:0b69c638 time: 0.057680447us my ffb: 00000000:050a27fb time: 0.025469722us my ffb 2: 00000000:04d32d78 time: 0.024384359us middle bit set ffb=159 my=159 my2=159 generic ffb: 00000000:0a1bc81f time: 0.051086903us my ffb: 00000000:020f3d7d time: 0.010408554us my ffb 2: 00000000:0324112a time: 0.015873555us first bit set ffb=0 my=0 my2=0 generic ffb: 00000000:095a794d time: 0.047270700us my ffb: 00000000:005af2d0 time: 0.001795467us my ffb 2: 00000000:005a0537 time: 0.001777144us With this evidence, I present my patch against the 2.6.12.2 kernel. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Index: vanilla_kernel/include/asm-i386/bitops.h =================================================================== --- vanilla_kernel/include/asm-i386/bitops.h (revision 263) +++ vanilla_kernel/include/asm-i386/bitops.h (working copy) @@ -311,6 +311,20 @@ int find_next_zero_bit(const unsigned long *addr, int size, int offset); /** + * __ffs - find first bit in word. + * @word: The word to search + * + * Undefined if no bit exists, so code should check against 0 first. + */ +static inline unsigned long __ffs(unsigned long word) +{ + __asm__("bsfl %1,%0" + :"=r" (word) + :"rm" (word)); + return word; +} + +/** * find_first_bit - find the first set bit in a memory region * @addr: The address to start the search at * @size: The maximum size to search @@ -320,22 +334,16 @@ */ static inline int find_first_bit(const unsigned long *addr, unsigned size) { - int d0, d1; - int res; - - /* This looks at memory. Mark it volatile to tell gcc not to move it around */ - __asm__ __volatile__( - "xorl %%eax,%%eax\n\t" - "repe; scasl\n\t" - "jz 1f\n\t" - "leal -4(%%edi),%%edi\n\t" - "bsfl (%%edi),%%eax\n" - "1:\tsubl %%ebx,%%edi\n\t" - "shll $3,%%edi\n\t" - "addl %%edi,%%eax" - :"=a" (res), "=&c" (d0), "=&D" (d1) - :"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory"); - return res; + int x = 0; + do { + if (*addr) + return __ffs(*addr) + x; + addr++; + if (x >= size) + break; + x += 32; + } while (1); + return x; } /** @@ -360,20 +368,6 @@ return word; } -/** - * __ffs - find first bit in word. - * @word: The word to search - * - * Undefined if no bit exists, so code should check against 0 first. - */ -static inline unsigned long __ffs(unsigned long word) -{ - __asm__("bsfl %1,%0" - :"=r" (word) - :"rm" (word)); - return word; -} - /* * fls: find last bit set. */ [-- Attachment #2: ffb.c --] [-- Type: text/x-csrc, Size: 3303 bytes --] #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #define unlikely(x) __builtin_expect(!!(x), 0) static inline int find_first_bit(const unsigned long *addr, unsigned size) { int d0, d1; int res; /* This looks at memory. Mark it volatile to tell gcc not to move it around */ __asm__ __volatile__( "xorl %%eax,%%eax\n\t" "repe; scasl\n\t" "jz 1f\n\t" "leal -4(%%edi),%%edi\n\t" "bsfl (%%edi),%%eax\n" "1:\tsubl %%ebx,%%edi\n\t" "shll $3,%%edi\n\t" "addl %%edi,%%eax" :"=a" (res), "=&c" (d0), "=&D" (d1) :"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory"); return res; } static inline unsigned long __ffs(unsigned long word) { __asm__("bsfl %1,%0" :"=r" (word) :"rm" (word)); return word; } static inline int my_find_first_bit(const unsigned long *b, unsigned size) { int x = 0; do { if (*b) return __ffs(*b) + x; b++; if (x >= size) break; x += 32; } while (1); return x; } static inline int my_find_first_bit2(const unsigned long *b, unsigned size) { int x = 0; do { if (unlikely(*b)) return __ffs(*b) + x; b++; if (x >= size) break; x += 32; } while (1); return x; } #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define rdtsc(low,high) \ __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)) #define BITSIZE 310 static unsigned long array[((BITSIZE)>>5)+1]; #define ITER 1000000 /* 1,000,000 times */ void testit(unsigned long *array, unsigned long long clock) { unsigned long long s; unsigned long long e; unsigned long long t; double f; int i; int x; /* * Since ITER is 1,000,000 the times will be in us. */ /* * Make sure that the output is correct. */ printf("ffb=%d my=%d my2=%d\n", find_first_bit(array,BITSIZE), my_find_first_bit(array,BITSIZE), my_find_first_bit2(array,BITSIZE)); rdtscll(s); for (i=0; i < ITER; i++) x = find_first_bit(array,BITSIZE); rdtscll(e); t = e - s; f = (float)t / (float)clock; printf("generic ffb: %08lx:%08lx\n", (unsigned long)(t>>32),(unsigned long)t); printf("time: %.09fus\n",f); rdtscll(s); for (i=0; i < ITER; i++) x = my_find_first_bit(array,BITSIZE); rdtscll(e); t = e - s; f = (float)t / (float)clock; printf("my ffb: %08lx:%08lx\n", (unsigned long)(t>>32),(unsigned long)t); printf("time: %.09fus\n",f); rdtscll(s); for (i=0; i < ITER; i++) x = my_find_first_bit2(array,BITSIZE); rdtscll(e); t = e - s; f = (float)t / (float)clock; printf("my ffb 2: %08lx:%08lx\n", (unsigned long)(t>>32),(unsigned long)t); printf("time: %.09fus\n",f); } int main(int argc, char **argv) { unsigned long long s; unsigned long long e; unsigned long long t; unsigned long long clock; /* * Calculate BS time, just to get an * idea of the tsc speed. */ rdtscll(s); sleep(8); rdtscll(e); t = e - s; t >>= 3; printf("clock speed = %08lx:%08lx %llu ticks per second\n", (unsigned long)(t>>32),(unsigned long)t, t); clock = t; printf("\nno bit set\n"); testit(array,clock); array[BITSIZE>>5] = 0x80000000; printf("\nlast bit set\n"); testit(array,clock); array[4] = 0x80000000; printf("\nmiddle bit set\n"); testit(array,clock); array[0] = 0x00000001; printf("\nfirst bit set\n"); testit(array,clock); exit(0); } ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 11:43 ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt @ 2005-07-28 12:45 ` Steven Rostedt 2005-07-28 15:31 ` Linus Torvalds 2005-07-28 15:30 ` Linus Torvalds 2005-07-28 17:52 ` Mitchell Blank Jr 2 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 12:45 UTC (permalink / raw) To: Nick Piggin Cc: Daniel Walker, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar [snip] > static inline int find_first_bit(const unsigned long *addr, unsigned size) > { [snip] > + int x = 0; > + do { > + if (*addr) > + return __ffs(*addr) + x; > + addr++; > + if (x >= size) > + break; > + x += 32; The 32 looks like it may be problamatic. Is there any i386 64 bit machines. Or is hard coding 32 OK? > + } while (1); > + return x; > } > Just in case, I've updated the patch to use (sizeof(*addr)<<3) Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Index: vanilla_kernel/include/asm-i386/bitops.h =================================================================== --- vanilla_kernel/include/asm-i386/bitops.h (revision 263) +++ vanilla_kernel/include/asm-i386/bitops.h (working copy) @@ -311,6 +311,20 @@ int find_next_zero_bit(const unsigned long *addr, int size, int offset); /** + * __ffs - find first bit in word. + * @word: The word to search + * + * Undefined if no bit exists, so code should check against 0 first. + */ +static inline unsigned long __ffs(unsigned long word) +{ + __asm__("bsfl %1,%0" + :"=r" (word) + :"rm" (word)); + return word; +} + +/** * find_first_bit - find the first set bit in a memory region * @addr: The address to start the search at * @size: The maximum size to search @@ -320,22 +334,16 @@ */ static inline int find_first_bit(const unsigned long *addr, unsigned size) { - int d0, d1; - int res; - - /* This looks at memory. Mark it volatile to tell gcc not to move it around */ - __asm__ __volatile__( - "xorl %%eax,%%eax\n\t" - "repe; scasl\n\t" - "jz 1f\n\t" - "leal -4(%%edi),%%edi\n\t" - "bsfl (%%edi),%%eax\n" - "1:\tsubl %%ebx,%%edi\n\t" - "shll $3,%%edi\n\t" - "addl %%edi,%%eax" - :"=a" (res), "=&c" (d0), "=&D" (d1) - :"1" ((size + 31) >> 5), "2" (addr), "b" (addr) : "memory"); - return res; + int x = 0; + do { + if (*addr) + return __ffs(*addr) + x; + addr++; + if (x >= size) + break; + x += (sizeof(*addr)<<3); + } while (1); + return x; } /** @@ -360,20 +368,6 @@ return word; } -/** - * __ffs - find first bit in word. - * @word: The word to search - * - * Undefined if no bit exists, so code should check against 0 first. - */ -static inline unsigned long __ffs(unsigned long word) -{ - __asm__("bsfl %1,%0" - :"=r" (word) - :"rm" (word)); - return word; -} - /* * fls: find last bit set. */ ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 12:45 ` Steven Rostedt @ 2005-07-28 15:31 ` Linus Torvalds 0 siblings, 0 replies; 50+ messages in thread From: Linus Torvalds @ 2005-07-28 15:31 UTC (permalink / raw) To: Steven Rostedt Cc: Nick Piggin, Daniel Walker, LKML, Andrew Morton, Ingo Molnar On Thu, 28 Jul 2005, Steven Rostedt wrote: > > The 32 looks like it may be problamatic. Is there any i386 64 bit > machines. Or is hard coding 32 OK? We have BITS_PER_LONG exactly for this usage, but the sizeof also works. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 11:43 ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt 2005-07-28 12:45 ` Steven Rostedt @ 2005-07-28 15:30 ` Linus Torvalds 2005-07-28 15:47 ` Steven Rostedt 2005-07-28 17:52 ` Mitchell Blank Jr 2 siblings, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2005-07-28 15:30 UTC (permalink / raw) To: Steven Rostedt Cc: Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Steven Rostedt wrote: > > In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO > configurable" I discovered that a C version of find_first_bit is faster > than the asm version now when compiled against gcc 3.3.6 and gcc 4.0.1 > (both from versions of Debian unstable). I wrote a benchmark (attached) > that runs the code 1,000,000 times. I suspect the old "rep scas" has always been slower than compiler-generated code, at least under your test conditions. Many of the old asm's are actually _very_ old, and some of them come from pre-0.01 days and are more about me learning the i386 (and gcc inline asm). That said, I don't much like your benchmarking methodology. I suspect that quite often, the code in question runs from L2 cache, not in a tight loop, and so that "run a million times" approach is not necessarily the best one. I'll apply this one as obvious: I doubt the compiler generates bigger code or has any real downsides, but I just wanted to say that in general I just wish people didn't always time the hot-cache case ;) Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 15:30 ` Linus Torvalds @ 2005-07-28 15:47 ` Steven Rostedt 2005-07-28 16:34 ` Maciej W. Rozycki 0 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 15:47 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 2005-07-28 at 08:30 -0700, Linus Torvalds wrote: > > I suspect the old "rep scas" has always been slower than > compiler-generated code, at least under your test conditions. Many of the > old asm's are actually _very_ old, and some of them come from pre-0.01 > days and are more about me learning the i386 (and gcc inline asm). I've been playing with different approaches, (still all hot cache though), and inspecting the generated code. It's not that the gcc generated code is always better for the normal case. But since it sees more and everything is not hidden in asm, it can optimise what is being used, and how it's used. > > That said, I don't much like your benchmarking methodology. I suspect that > quite often, the code in question runs from L2 cache, not in a tight loop, > and so that "run a million times" approach is not necessarily the best > one. Well, I never said I was a test benchmark writer :-). If you know of a better way to benchmark these, then let me know. I also thought that having all in a hot cache could help with showing the differences. But I guess I would need to test this in other ways. > > I'll apply this one as obvious: I doubt the compiler generates bigger code > or has any real downsides, but I just wanted to say that in general I just > wish people didn't always time the hot-cache case ;) I've just finished a version of find_first_zero_bit too. It has the same comparisons as the find_first_bit but not as drastic. Do you want this too, and if so, as a separate patch on top of the first one, or against 2.6.12.2 (that's the kernel I'm working with right now) or do you want me to submit a new patch with both changes? -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 15:47 ` Steven Rostedt @ 2005-07-28 16:34 ` Maciej W. Rozycki 2005-07-28 16:57 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Maciej W. Rozycki @ 2005-07-28 16:34 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Steven Rostedt wrote: > I've been playing with different approaches, (still all hot cache > though), and inspecting the generated code. It's not that the gcc > generated code is always better for the normal case. But since it sees > more and everything is not hidden in asm, it can optimise what is being > used, and how it's used. Since you're considering GCC-generated code for ffs(), ffz() and friends, how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as apropriate? Reasonably recent GCC may actually be good enough to use the fastest code depending on the processor submodel selected. Maciej ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 16:34 ` Maciej W. Rozycki @ 2005-07-28 16:57 ` Steven Rostedt 2005-07-28 17:25 ` Linus Torvalds 2005-07-28 17:17 ` Linus Torvalds 2005-07-28 18:25 ` Steven Rostedt 2 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 16:57 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Linus Torvalds, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 2005-07-28 at 17:34 +0100, Maciej W. Rozycki wrote: > Since you're considering GCC-generated code for ffs(), ffz() and friends, > how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as > apropriate? Reasonably recent GCC may actually be good enough to use the > fastest code depending on the processor submodel selected. I can change the find_first_bit to use __builtin_ffs, but how would you implement the ffz? The clz and ctz only count the number of leading or trailing zeros respectively, it doesn't find the first zero. Of course a __builtin_ctz(~x) would but this might take longer than what we already have. I'll go ahead and try it and see. But I still don't have a decent benchmark on this. I'll start looking into the kernel and see how it's used, and see if I can find a proper benchmark. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 16:57 ` Steven Rostedt @ 2005-07-28 17:25 ` Linus Torvalds 2005-07-29 10:03 ` David Woodhouse 2005-07-29 14:39 ` Maciej W. Rozycki 0 siblings, 2 replies; 50+ messages in thread From: Linus Torvalds @ 2005-07-28 17:25 UTC (permalink / raw) To: Steven Rostedt Cc: Maciej W. Rozycki, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Steven Rostedt wrote: > > I can change the find_first_bit to use __builtin_ffs, but how would you > implement the ffz? The thing is, there are basically _zero_ upsides to using the __builtin_xx functions on x86. There may be more upsides on other architectures (*cough*ia64*cough*) that have strange scheduling issues and other complexities, but on x86 in particular, the __builtin_xxx() functions tend to be a lot more pain than they are worth. Not only do they have strange limitations (on selection of opcodes but also for compiler versions), but they aren't well documented, and semantics aren't clear. For example, if you use the "bsfl" inline assembly instruction, you know what the semantics are and what the generated code is like: Intel documents it, and you know what code you generated. So the special cases like "what happens if the input is zero" are well-defined. In contrast, the gcc builtins probably match some standard that is not only harder to find, but also has some _other_ definition for what happens for the zero case, so the builtins automatically end up having problems due to semantic mis-match between the CPU and the standard. Basic rule: inline assembly is _better_ than random compiler extensions. It's better to have _one_ well-documented extension that is very generic than it is to have a thousand specialized extensions. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 17:25 ` Linus Torvalds @ 2005-07-29 10:03 ` David Woodhouse 2005-07-29 14:41 ` Maciej W. Rozycki 2005-07-29 16:23 ` Linus Torvalds 2005-07-29 14:39 ` Maciej W. Rozycki 1 sibling, 2 replies; 50+ messages in thread From: David Woodhouse @ 2005-07-29 10:03 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Maciej W. Rozycki, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 2005-07-28 at 10:25 -0700, Linus Torvalds wrote: > Basic rule: inline assembly is _better_ than random compiler extensions. > It's better to have _one_ well-documented extension that is very generic > than it is to have a thousand specialized extensions. Counterexample: FR-V and its __builtin_read8() et al. For FR-V you have to issue a memory barrier before or after certain I/O instructions, but in some circumstances you can omit them. The compiler knows this and can omit the membar instructions as appropriate -- but doing the same optimisations in inline assembly would be fairly much impossible. Builtins can also allow the compiler more visibility into what's going on and more opportunity to optimise. They can also set condition registers, which you can't do from inline assembly -- if you want to perform a test in inline asm, you have to put the result in a register and then test the contents of that register. (You can't just branch from the inline asm either, although we used to try). Builtins are more portable and their implementation will improve to match developments in the target CPU. Inline assembly, as we have seen, remains the same for years while the technology moves on. Although it's often the case that inline assembly _is_ better, especially in code which is arch-specific in the first place, I wouldn't necessarily assume that it's always the case. -- dwmw2 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-29 10:03 ` David Woodhouse @ 2005-07-29 14:41 ` Maciej W. Rozycki 2005-07-29 16:23 ` Linus Torvalds 1 sibling, 0 replies; 50+ messages in thread From: Maciej W. Rozycki @ 2005-07-29 14:41 UTC (permalink / raw) To: David Woodhouse Cc: Linus Torvalds, Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Fri, 29 Jul 2005, David Woodhouse wrote: > Builtins are more portable and their implementation will improve to > match developments in the target CPU. Inline assembly, as we have seen, > remains the same for years while the technology moves on. > > Although it's often the case that inline assembly _is_ better, > especially in code which is arch-specific in the first place, I wouldn't > necessarily assume that it's always the case. Well, if some inline assembly is found to be better, then perhaps it should be contributed (not necessarily as is, but as a concept) to GCC for improvement. Maciej ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-29 10:03 ` David Woodhouse 2005-07-29 14:41 ` Maciej W. Rozycki @ 2005-07-29 16:23 ` Linus Torvalds 1 sibling, 0 replies; 50+ messages in thread From: Linus Torvalds @ 2005-07-29 16:23 UTC (permalink / raw) To: David Woodhouse Cc: Steven Rostedt, Maciej W. Rozycki, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Fri, 29 Jul 2005, David Woodhouse wrote: > > On Thu, 2005-07-28 at 10:25 -0700, Linus Torvalds wrote: > > Basic rule: inline assembly is _better_ than random compiler extensions. > > It's better to have _one_ well-documented extension that is very generic > > than it is to have a thousand specialized extensions. > > Counterexample: FR-V and its __builtin_read8() et al. There are arguably always counter-examples, but your arguments really are pretty theoretical. Very seldom does compiler extensions end up being (a) timely enough and (b) semantically close enough to be really useful. > Builtins can also allow the compiler more visibility into what's going > on and more opportunity to optimise. Absolutely. In theory. In practice, not so much. All the opportunity to optimize often ends up being lost in semantic clashes, or just because people can't use the extension because it hasn't been there since day one. The fact is, inline asms are pretty rare even when we are talking about every single possible assembly combination. They are even less common when we're talking about just _one_ specific case of them (like something like __builtin_ffs()). What does this mean? It has two results: (a) instruction-level scheduling and register allocation just isn't _that_ important, and the generic "asm" register scheduling is really plenty good enough. The fact that in theory you might get better results if the compiler knew exactly what was going on is just not relevant: in practice it's simply not _true_. The other result is: (b) the compiler people don't end up seeing something like the esoteric builtins as a primary thing, so it's not like they'd be tweaking and regression-testing everything _anyway_. So I argue very strongly that __builtin_xxx() is _wrong_, unless you have very very strong reasons for it: - truly generic and _very_ important stuff: __builtin_memcpy() is actually very much worth it, since it's all over, and it's so generic that the compiler has a lot of choice in how to do it. - stuff where the architecture (or the compiler) -really- sucks with inline asms, and has serious problems, and the thing is really important. Your FR-V example _might_ fall into this category (or it might not), and ia64 has the problem with instruction packing and scheduling and so __builtin's have a bigger advantage. Basically, on most normal architectures, there's seldom any reason at _all_ to use builtins except for things like memcpy. On x86, I think the counter-example might be if you want to schedule MMX code from C - which is a special case because it doesn't follow my "rule (a)" above. But we don't do that in the kernel, really, or we just schedule it out-of-line. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 17:25 ` Linus Torvalds 2005-07-29 10:03 ` David Woodhouse @ 2005-07-29 14:39 ` Maciej W. Rozycki 2005-07-29 16:29 ` Linus Torvalds 1 sibling, 1 reply; 50+ messages in thread From: Maciej W. Rozycki @ 2005-07-29 14:39 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Linus Torvalds wrote: > There may be more upsides on other architectures (*cough*ia64*cough*) that > have strange scheduling issues and other complexities, but on x86 in > particular, the __builtin_xxx() functions tend to be a lot more pain than > they are worth. Not only do they have strange limitations (on selection of They can be buggy, sure, just like any code. If they indeed are we may want to avoid them rather than requiring a GCC upgrade, of course. > opcodes but also for compiler versions), but they aren't well documented, > and semantics aren't clear. Hmm, that's what's in the GCC info pages for the relevant functions (I've omitted the "l" and "ll" variants): "-- Built-in Function: int __builtin_ffs (unsigned int x) Returns one plus the index of the least significant 1-bit of X, or if X is zero, returns zero. -- Built-in Function: int __builtin_clz (unsigned int x) Returns the number of leading 0-bits in X, starting at the most significant bit position. If X is 0, the result is undefined. -- Built-in Function: int __builtin_ctz (unsigned int x) Returns the number of trailing 0-bits in X, starting at the least significant bit position. If X is 0, the result is undefined." If that's not enough, then what would be? I'm serious -- if you find it inadequate, then perhaps it could be improved. > In contrast, the gcc builtins probably match some standard that is not > only harder to find, but also has some _other_ definition for what happens > for the zero case, so the builtins automatically end up having problems > due to semantic mis-match between the CPU and the standard. GCC should know the semantics of underlying CPU instructions used and be able to optimize expressions for common cases, e.g. like: return x == 0 ? 32 : __builtin_ctz(x); when the CPU provides a "ctz" operation that returns 32 for 0. > Basic rule: inline assembly is _better_ than random compiler extensions. > It's better to have _one_ well-documented extension that is very generic > than it is to have a thousand specialized extensions. It depends on how many submodel-specific variants of inline assembly you need, how many reloads are required for constraints, possibly defeating the gain, etc. In this particular case "bsf" and "bsr" are notoriously slow for some i386 submodels, so using the generic O(log n) algorithm may result in better performance for them. E.g. the execution time for "bsf" for the original i386 is 11 + 3n clock ticks (n refers to the resulting bit index), "bsr" for the i486 is 7 - 104 ticks (so again 3n), for Pentium -- 7 - 72 ticks (2n, then). It does not immediately mean they are worse, but they are slow enough for the pessimistic case checking alternatives is not unreasonable. Maciej ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-29 14:39 ` Maciej W. Rozycki @ 2005-07-29 16:29 ` Linus Torvalds 2005-07-29 17:14 ` Maciej W. Rozycki 0 siblings, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2005-07-29 16:29 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Fri, 29 Jul 2005, Maciej W. Rozycki wrote: > > Hmm, that's what's in the GCC info pages for the relevant functions > (I've omitted the "l" and "ll" variants): > > "-- Built-in Function: int __builtin_ffs (unsigned int x) > Returns one plus the index of the least significant 1-bit of X, or > if X is zero, returns zero. This, for example, clashes with the x86 semantics. If X is zero, the bsfl instruction will set the ZF flag, and the result is undefined (on many, but not all, CPU's it will either be zero _or_ unmodified). We don't care, since we actually test the input for being zero separately _anyway_, but my point is that if the builtin is badly done (and I wouldn't be in the least surprised if it was), then it's going to do a totally unnecessary conditional jump of cmov. See? __builtin's can generate _worse_ code, exactly because they try to have portable semantics that may not even matter. In contrast, just doing it by hand allows us to avoid all that crap. Doing it by hand as inline assembly also allows us to do dynamic optimizations like instruction rewriting, so inline assembly is a _lot_ more powerful than builtins can reasonably ever be. > If that's not enough, then what would be? I'm serious -- if you find it > inadequate, then perhaps it could be improved. It's inadequate because IT IS POINTLESS. The builtin buys you absolutely _nothing_, and the inline asm is simpler, potentially faster, and works with every single version of gcc. USING THE BUILTIN IS A PESSIMISATION! It has absolutely _zero_ upsides, and I've named three _major_ downsides. It has another downside too: it's extra complexity and potential for bugs in the compiler. And if you tell me gcc people never have bugs, I will laugh in your general direction. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-29 16:29 ` Linus Torvalds @ 2005-07-29 17:14 ` Maciej W. Rozycki 0 siblings, 0 replies; 50+ messages in thread From: Maciej W. Rozycki @ 2005-07-29 17:14 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Fri, 29 Jul 2005, Linus Torvalds wrote: > It has another downside too: it's extra complexity and potential for bugs > in the compiler. And if you tell me gcc people never have bugs, I will > laugh in your general direction. You mean these that have been sitting in their Bugzilla for some three years with no resolution and only occasional scratching of heads? ;-) Maciej ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 16:34 ` Maciej W. Rozycki 2005-07-28 16:57 ` Steven Rostedt @ 2005-07-28 17:17 ` Linus Torvalds 2005-07-29 15:09 ` Maciej W. Rozycki 2005-07-28 18:25 ` Steven Rostedt 2 siblings, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2005-07-28 17:17 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Maciej W. Rozycki wrote: > > Since you're considering GCC-generated code for ffs(), ffz() and friends, > how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as > apropriate? Please don't. Try again in three years when everybody has them. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 17:17 ` Linus Torvalds @ 2005-07-29 15:09 ` Maciej W. Rozycki 0 siblings, 0 replies; 50+ messages in thread From: Maciej W. Rozycki @ 2005-07-29 15:09 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Linus Torvalds wrote: > > Since you're considering GCC-generated code for ffs(), ffz() and friends, > > how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as > > apropriate? > > Please don't. Try again in three years when everybody has them. Well, __builtin_ffs() has been there since at least gcc 2.95. The two others are quite recent, indeed -- apparently only since GCC 3.4. They may still be considered to be used conditionally if there is justified benefit. Maciej ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 16:34 ` Maciej W. Rozycki 2005-07-28 16:57 ` Steven Rostedt 2005-07-28 17:17 ` Linus Torvalds @ 2005-07-28 18:25 ` Steven Rostedt 2005-07-28 18:56 ` Linus Torvalds 2 siblings, 1 reply; 50+ messages in thread From: Steven Rostedt @ 2005-07-28 18:25 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Mitchell Blank Jr, Linus Torvalds, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 2005-07-28 at 17:34 +0100, Maciej W. Rozycki wrote: > On Thu, 28 Jul 2005, Steven Rostedt wrote: > > > I've been playing with different approaches, (still all hot cache > > though), and inspecting the generated code. It's not that the gcc > > generated code is always better for the normal case. But since it sees > > more and everything is not hidden in asm, it can optimise what is being > > used, and how it's used. > > Since you're considering GCC-generated code for ffs(), ffz() and friends, > how about trying __builtin_ffs(), __builtin_clz() and __builtin_ctz() as > apropriate? Reasonably recent GCC may actually be good enough to use the > fastest code depending on the processor submodel selected. > OK, I guess when I get some time, I'll start testing all the i386 bitop functions, comparing the asm with the gcc versions. Now could someone explain to me what's wrong with testing hot cache code. Can one instruction retrieve from memory better than others? I was trying to see which whas slower in CPU, but if an algorithm aligns with the cache or something that is faster, my hot cache testing will not catch that. But I don't know how to write a test that would test over and over again something that is not in cache. It would seem that I would have to find a way to flush the L1 and L2 cache each time. But it still seems to be adding too many variables to the equation to get good meaningful benchmarks. -- Steve ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 18:25 ` Steven Rostedt @ 2005-07-28 18:56 ` Linus Torvalds 0 siblings, 0 replies; 50+ messages in thread From: Linus Torvalds @ 2005-07-28 18:56 UTC (permalink / raw) To: Steven Rostedt Cc: Maciej W. Rozycki, Mitchell Blank Jr, Nick Piggin, Ingo Molnar, Andrew Morton, LKML, Daniel Walker On Thu, 28 Jul 2005, Steven Rostedt wrote: > > OK, I guess when I get some time, I'll start testing all the i386 bitop > functions, comparing the asm with the gcc versions. Now could someone > explain to me what's wrong with testing hot cache code. Can one > instruction retrieve from memory better than others? There's a few issues: - trivially: code/data size. Being smaller automatically means faster if you're cold-cache. If you do cycle tweaking of something that is possibly commonly in the L2 cache or further away, you migt as well consider one byte of code-space to be equivalent to one cycle (a L1 I$ miss can easily take 50+ cycles - the L1 fill cost may be just a small part of that, but the pipeline problem it causes can be deadly). - branch prediction: cold-cache is _different_ from hot-cache. hit-cache predicts the stuff dynamically, cold-cache has different rules (and it is _usually_ "forward predicts not-taken, backwards predicts taken", although you can add static hints if you want to on most architectures). So hot-cache may look very different indeed - the "normal" case might be that you mispredict all the time because the static prediction is wrong, but then a hot-cache benchmark will predict perfectly. - access patterns. This only matters if you look at algorithmic changes. Hashes have atrocious locality, but on the other hand, if you know that the access pattern is cold, a hash will often have a minimum number of accesses. but no, you don't have "some instructions are better at reading from memory" for regular integer code (FP often has other issues, like reading directly from L2 without polluting L1, and then there are obviously prefetch hints). Now, in the case of your "rep scas" conversion, the reason I applied it was that it was obviously a clear win (rep scas is known bad, and has register allocation issues too), so I'm _not_ claiming that the above issues were true in that case. I just wanted to say that in general it's nice (but often quite hard) if you can give cold-cache numbers too (for example, using the cycle counter and being clever can actually give that). Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work) 2005-07-28 11:43 ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt 2005-07-28 12:45 ` Steven Rostedt 2005-07-28 15:30 ` Linus Torvalds @ 2005-07-28 17:52 ` Mitchell Blank Jr 2 siblings, 0 replies; 50+ messages in thread From: Mitchell Blank Jr @ 2005-07-28 17:52 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML Steven Rostedt wrote: > In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO > configurable" I discovered that a C version of find_first_bit is faster > than the asm version There are probably other cases of this in asm-i386/bitopts.h. For instance I think the "btl" instruction is pretty slow on modern CPUs so constant_test_bit() will probably outperform variable_test_bit() even if you feed it a non-constant "nr". I'd be happy to be proven wrong, though :-) When testing these optimizations you should also probably check the resulting vmlinux size. -Mitch ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2005-07-29 17:16 UTC | newest] Thread overview: 50+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-07-27 14:13 [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Steven Rostedt 2005-07-27 14:17 ` Ingo Molnar 2005-07-27 14:26 ` Steven Rostedt 2005-07-27 14:33 ` Ingo Molnar 2005-07-27 14:47 ` [PATCH] safty check of MAX_RT_PRIO >= MAX_USER_RT_PRIO Steven Rostedt 2005-07-27 15:05 ` Steven Rostedt 2005-07-27 18:52 ` Ingo Molnar 2005-07-27 14:53 ` [RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO configurable Esben Nielsen 2005-07-27 15:02 ` Steven Rostedt 2005-07-27 16:09 ` K.R. Foley 2005-07-27 17:01 ` Esben Nielsen 2005-07-27 17:25 ` Steven Rostedt 2005-07-27 21:32 ` Esben Nielsen 2005-07-28 12:17 ` Steven Rostedt 2005-07-28 7:22 ` Ingo Molnar 2005-07-28 11:53 ` Steven Rostedt 2005-07-27 17:42 ` K.R. Foley 2005-07-28 9:59 ` Esben Nielsen 2005-07-27 14:28 ` Steven Rostedt 2005-07-27 14:38 ` Ingo Molnar 2005-07-27 14:46 ` Steven Rostedt 2005-07-28 7:33 ` Ingo Molnar 2005-07-28 1:42 ` Matt Mackall 2005-07-28 1:00 ` Daniel Walker 2005-07-28 1:20 ` Lee Revell 2005-07-28 1:26 ` Steven Rostedt 2005-07-28 1:25 ` Steven Rostedt 2005-07-28 3:06 ` Steven Rostedt 2005-07-28 3:32 ` Steven Rostedt 2005-07-28 3:45 ` Steven Rostedt 2005-07-28 3:51 ` Nick Piggin 2005-07-28 11:43 ` [PATCH] speed up on find_first_bit for i386 (let compiler do the work) Steven Rostedt 2005-07-28 12:45 ` Steven Rostedt 2005-07-28 15:31 ` Linus Torvalds 2005-07-28 15:30 ` Linus Torvalds 2005-07-28 15:47 ` Steven Rostedt 2005-07-28 16:34 ` Maciej W. Rozycki 2005-07-28 16:57 ` Steven Rostedt 2005-07-28 17:25 ` Linus Torvalds 2005-07-29 10:03 ` David Woodhouse 2005-07-29 14:41 ` Maciej W. Rozycki 2005-07-29 16:23 ` Linus Torvalds 2005-07-29 14:39 ` Maciej W. Rozycki 2005-07-29 16:29 ` Linus Torvalds 2005-07-29 17:14 ` Maciej W. Rozycki 2005-07-28 17:17 ` Linus Torvalds 2005-07-29 15:09 ` Maciej W. Rozycki 2005-07-28 18:25 ` Steven Rostedt 2005-07-28 18:56 ` Linus Torvalds 2005-07-28 17:52 ` Mitchell Blank Jr
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox