* CFS: some bad numbers with Java/database threading @ 2007-09-12 23:10 Antoine Martin 2007-09-13 7:18 ` David Schwartz 2007-09-13 11:24 ` CFS: " Ingo Molnar 0 siblings, 2 replies; 32+ messages in thread From: Antoine Martin @ 2007-09-12 23:10 UTC (permalink / raw) To: Linux Kernel Development -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hi list, I was working on some unit tests and thought I'd give CFS a whirl to see if it had any impact on my workloads (to see what the fuss was about), and I came up with some pretty disturbing numbers: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests-noload2.png As above but also showing the load average: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests2.png Looks like a regression to me... Basically, all the previous kernels are pretty close (2.6.16 through to 2.6.20 performed almost identically to 2.6.22 and are not shown here to avoid cluttering the graphs) All the 2.6.23-rc kernels performed poorly (except -rc3!): much more erratically and with a sharp performance drop above 800 threads. The load starts to go up and the performance takes a nosedive. With fewer threads (less than 50) there is hardly any difference at all between all the kernels. Notes about the tests and setup: * environment is: Dual Opteron 252 with 3GB ram, scsi disk, etc.. Sun Java 1.6 MySQL 5.0.44 Junit + ant + my test code (devloop.org.uk) * java threads are created first and the data is prepared, then all the threads are started in a tight loop. Each thread runs multiple queries with a 10ms pause (to allow the other threads to get scheduled) * load average is divided by the number of cpus (2) * more general information (which also covers some irrelevant information about some other tests I have published) is here: http://devloop.org.uk/documentation/database-performance/Setup/ Don't shoot the messenger! I can run some more tests if needed (bearing in mind that a full test run takes a few hours) or you can run the tests yourself: instructions on running the tests are included. I am now re-testing with sched-cfs-v2.6.23-rc6-v21-combo-2.patch but feel free to send some other patches. Antoine -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG6HH+GK2zHPGK1rsRCl3oAJ9c4crCtNQfGs9gWO7Y5CvcIno8TACbBPTw 0TEHkqLMGAfH0ILwWVKc0oo= =1iBA -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: some bad numbers with Java/database threading 2007-09-12 23:10 CFS: some bad numbers with Java/database threading Antoine Martin @ 2007-09-13 7:18 ` David Schwartz 2007-09-12 23:33 ` Nick Piggin 2007-09-13 11:24 ` CFS: " Ingo Molnar 1 sibling, 1 reply; 32+ messages in thread From: David Schwartz @ 2007-09-13 7:18 UTC (permalink / raw) To: Linux Kernel Development > I was working on some unit tests and thought I'd give CFS a whirl to see > if it had any impact on my workloads (to see what the fuss was about), > and I came up with some pretty disturbing numbers: > http://devloop.org.uk/documentation/database-performance/Linux-Ker > nels/Kernels-ManyThreads-CombinedTests-noload2.png > As above but also showing the load average: > http://devloop.org.uk/documentation/database-performance/Linux-Ker > nels/Kernels-ManyThreads-CombinedTests2.png > Looks like a regression to me... I've tried reasonalby diligently to figure out what the hell you're doing and gone through quite a bit of your documentation, and I just can't figure it out. This could entirely be the result of your test's sensitivity to execution order. For example, if you run ten threads that all insert, query, and delete from the *same* table, then the exact interleaving pattern will determine the size of the results. A slight change in the scheduling quantum could multiply the size of the result data by a huge factor. There is a big difference between: 1) Thread A inserts data. 2) Thread A queries data. 3) Thread A deletes data. 4) Thread B inserts data. ... and 1) Thread A inserts data. 2) Thread B insers data. ... 101) Thread A queries data. 102) Thread B queries data. ... Now, even if they're using separate tables, your test is still very sensitive to execution order. If thread A runs to completion and then thread B does, the database data will fit better into cache. If thread A runs partially, then thread B runs partially, when thread A runs again, its database stuff will not be hot. >* java threads are created first and the data is prepared, then all the >threads are started in a tight loop. Each thread runs multiple queries >with a 10ms pause (to allow the other threads to get scheduled) There are a number of ways you might be measuring nothing but how the scheduler chooses to interleave your threads. Benchmarking threads that yield suggests just this type of thing -- if a thread has useful work to do and another thread is not going to help it, *why* *yield*? Are you worried the scheduler isn't going to schedule other threads?! Or is there some sane reason to force suboptimal scheduling when you're trying to benchmark a scheduler? Are you trying to see how it deals with pathological patterns? ;) The only documentation I can see about what you're actually *doing* says things like "The schema and statements are almost identical to the non-threaded tests." Do you see why that's not helpful? DS ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: some bad numbers with Java/database threading 2007-09-13 7:18 ` David Schwartz @ 2007-09-12 23:33 ` Nick Piggin 2007-09-13 19:02 ` Antoine Martin 0 siblings, 1 reply; 32+ messages in thread From: Nick Piggin @ 2007-09-12 23:33 UTC (permalink / raw) To: davids, Antoine Martin; +Cc: Linux Kernel Development On Thursday 13 September 2007 17:18, David Schwartz wrote: > > I was working on some unit tests and thought I'd give CFS a whirl to see > > if it had any impact on my workloads (to see what the fuss was about), > > and I came up with some pretty disturbing numbers: > > http://devloop.org.uk/documentation/database-performance/Linux-Ker > > nels/Kernels-ManyThreads-CombinedTests-noload2.png > > As above but also showing the load average: > > http://devloop.org.uk/documentation/database-performance/Linux-Ker > > nels/Kernels-ManyThreads-CombinedTests2.png > > Looks like a regression to me... > > I've tried reasonalby diligently to figure out what the hell you're doing (cc's readded please reply to all when replying to lkml) Hi David, You might be sounding a bit too abrasive here... I understand you're also trying to help, but your tone just might be taken the wrong way. Antonie is really doing the right thing here to test such a new feature early and on the code he cares about as a user. And most importantly, reporting it here. This is probably the most useful resource we have in Linux. Maybe the workload is quirky, but regardless, if it is a *regression* from a previous kernel then it is really important to be brought to our attention. > and gone through quite a bit of your documentation, and I just can't figure > it out. This could entirely be the result of your test's sensitivity to > execution order. > > For example, if you run ten threads that all insert, query, and delete from > the *same* table, then the exact interleaving pattern will determine the > size of the results. A slight change in the scheduling quantum could > multiply the size of the result data by a huge factor. There is a big > difference between: > > 1) Thread A inserts data. > 2) Thread A queries data. > 3) Thread A deletes data. > 4) Thread B inserts data. > ... > > > and > 1) Thread A inserts data. > 2) Thread B insers data. > ... > 101) Thread A queries data. > 102) Thread B queries data. > ... > > Now, even if they're using separate tables, your test is still very > sensitive to execution order. If thread A runs to completion and then > thread B does, the database data will fit better into cache. If thread A > runs partially, then thread B runs partially, when thread A runs again, its > database stuff will not be hot. > > >* java threads are created first and the data is prepared, then all the > >threads are started in a tight loop. Each thread runs multiple queries > >with a 10ms pause (to allow the other threads to get scheduled) > > There are a number of ways you might be measuring nothing but how the > scheduler chooses to interleave your threads. Benchmarking threads that > yield suggests just this type of thing -- if a thread has useful work to do > and another thread is not going to help it, *why* *yield*? > > Are you worried the scheduler isn't going to schedule other threads?! Or is > there some sane reason to force suboptimal scheduling when you're trying to > benchmark a scheduler? Are you trying to see how it deals with pathological > patterns? ;) > > The only documentation I can see about what you're actually *doing* says > things like "The schema and statements are almost identical to the > non-threaded tests." Do you see why that's not helpful? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: some bad numbers with Java/database threading 2007-09-12 23:33 ` Nick Piggin @ 2007-09-13 19:02 ` Antoine Martin 2007-09-13 21:47 ` David Schwartz 0 siblings, 1 reply; 32+ messages in thread From: Antoine Martin @ 2007-09-13 19:02 UTC (permalink / raw) To: Nick Piggin; +Cc: davids, Linux Kernel Development -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 I've tested a couple more kernels: 2.6.23-rc4-mm1 and 2.6.23-rc6 with the "sched_yield_bug_workaround" patch from Ingo, results are here: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield.png Same, with the load average (dotted lines): http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield-withload.png Still slows down after 800 threads... > On Thursday 13 September 2007 17:18, David Schwartz wrote: >>> I was working on some unit tests and thought I'd give CFS a whirl to see >>> if it had any impact on my workloads (to see what the fuss was about), >>> and I came up with some pretty disturbing numbers: >>> http://devloop.org.uk/documentation/database-performance/Linux-Ker >>> nels/Kernels-ManyThreads-CombinedTests-noload2.png >>> As above but also showing the load average: >>> http://devloop.org.uk/documentation/database-performance/Linux-Ker >>> nels/Kernels-ManyThreads-CombinedTests2.png >>> Looks like a regression to me... >> I've tried reasonalby diligently to figure out what the hell you're doing > > (cc's readded please reply to all when replying to lkml) (...) >> and gone through quite a bit of your documentation, and I just can't figure >> it out. Until I find enough time to update the docs, feel free to ask. >> This could entirely be the result of your test's sensitivity to >> execution order. It very well might be, but this is still a visible regression. >> For example, if you run ten threads that all insert, query, and delete from >> the *same* table, then the exact interleaving pattern will determine the >> size of the results. A slight change in the scheduling quantum could >> multiply the size of the result data by a huge factor. There is a big >> difference between: >> >> 1) Thread A inserts data. >> 2) Thread A queries data. >> 3) Thread A deletes data. >> 4) Thread B inserts data. >> ... >> >> >> and >> 1) Thread A inserts data. >> 2) Thread B insers data. >> ... >> 101) Thread A queries data. >> 102) Thread B queries data. Thanks for the suggestion, I'll add this to the test series. >> Now, even if they're using separate tables, your test is still very >> sensitive to execution order. If thread A runs to completion and then >> thread B does, the database data will fit better into cache. If thread A >> runs partially, then thread B runs partially, when thread A runs again, its >> database stuff will not be hot. You are correct, it is very sensitive to the execution order and caching. When I vary the thread pause, the total execution time varies widely. 10ms just happens to be the sweet spot which provides the best contrast in the results (for both kernels and rdbms) >>> * java threads are created first and the data is prepared, then all the >>> threads are started in a tight loop. Each thread runs multiple queries >>> with a 10ms pause (to allow the other threads to get scheduled) >> There are a number of ways you might be measuring nothing but how the >> scheduler chooses to interleave your threads. Benchmarking threads that >> yield suggests just this type of thing -- if a thread has useful work to do >> and another thread is not going to help it, *why* *yield*? As I said above, I have tried varying the pause and this is the sweet spot. If you don't yield, the first hundred threads will be finished by the time you start the 800th - which is not what you want. I could try just yielding at the start, or only during the first iteration, etc.. all suggestions are most welcome. >> Are you worried the scheduler isn't going to schedule other threads?! Yes! >> Or is >> there some sane reason to force suboptimal scheduling when you're trying to >> benchmark a scheduler? Are you trying to see how it deals with pathological >> patterns? ;) I know it sounds sub-optimal, but this benchmark wasn't designed to test the scheduler! It is meant to thrash just one database table. It isn't meant to be anything like TPC either. Previously it did uncover some very noticeable differences in JVM performance when stress-tested with a large amount of threads. >> The only documentation I can see about what you're actually *doing* says >> things like "The schema and statements are almost identical to the >> non-threaded tests." Do you see why that's not helpful? I sure do! I'll try to improve on that when I get a chance. Here is a start: the schema is configurable with simple text files. And the database layer translates it into the SQL syntax (and types) supported by the backend (MySQL in this case): # cat ManyThreadedCombinedTest.properties table=update columns=i1:integer,i2:integer,s1:varchar,s2:varchar I'll make a click&run tarball if anyone is interested in running the tests themselves (it outputs csv data). Thanks for the feedback. Antoine -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG6Yk8GK2zHPGK1rsRCtVaAKCAVyU4woztnl6q0NAZBNsM94I2JgCcD4M3 +MpR1gAom65Mn/tQX8ig1cI= =Q3IY -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: some bad numbers with Java/database threading 2007-09-13 19:02 ` Antoine Martin @ 2007-09-13 21:47 ` David Schwartz 0 siblings, 0 replies; 32+ messages in thread From: David Schwartz @ 2007-09-13 21:47 UTC (permalink / raw) To: Linux-Kernel@Vger. Kernel. Org First, let me apologize if the tone of my other post came through as angry or frustrated. Text can sometimes be misleading as to tone. I certainly wasn't angry. Second, let me say that I'm definitely not suggesting that you were wrong to bring this to everyone's attention. Even if it turns out that your code is horribly broken and it's all your fault, any apparent regression still has to be investigated. It is virtually certain that we will learn something interesting about either your code, CFS, or both. I was definitely *not* saying "how dare you challenge CFS' supremacy without a perfect test case". Antoine Martin wrote: > >> Now, even if they're using separate tables, your test is still very > >> sensitive to execution order. If thread A runs to completion and then > >> thread B does, the database data will fit better into cache. > >> If thread A > >> runs partially, then thread B runs partially, when thread A > >> runs again, its > >> database stuff will not be hot. > You are correct, it is very sensitive to the execution order and > caching. When I vary the thread pause, the total execution time varies > widely. 10ms just happens to be the sweet spot which provides the best > contrast in the results (for both kernels and rdbms) > >> Or is > >> there some sane reason to force suboptimal scheduling when > >> you're trying to > >> benchmark a scheduler? Are you trying to see how it deals with > >> pathological > >> patterns? ;) > I know it sounds sub-optimal, but this benchmark wasn't designed to test > the scheduler! It is meant to thrash just one database table. It isn't > meant to be anything like TPC either. > Previously it did uncover some very noticeable differences in JVM > performance when stress-tested with a large amount of threads. The problem is that because your access pattern is pathological, schedulers that are objectively worse may turn in much better performance. For example, a scheduler that completely ignores your attempt to sleep will probably perform significantly better than a scheduler that goes out of its way to honor it. That the execution time is very sensitive to the pause is strong evidence of this. The problem is simply that your test program doesn't do a fixed amount of work. It does a variable amount of work, and that amount depends upon scheduler details. It's like a job that has fifty threads that do this: 1) Increment a shared variable. 2) Do a math problem a number of times equal to the value of that shared variable. 3) Decrement the shared variable. The problem is that the number of times the math problem is done depends upon the execution order of the threads. To be fair, you would need to benchmark how many times the math problem gets done, not how long the threads take to complete. Now, imagine if I insert a yield between steps 1 and 2. The more a scheduler honors your yield request, the worse it will appear to perform. The scheduler that ignores it (which is legal, but definitely not The Right Thing) will seem to perform *much* better. Of course, it's also possible that this is not what's going on. DS ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading 2007-09-12 23:10 CFS: some bad numbers with Java/database threading Antoine Martin 2007-09-13 7:18 ` David Schwartz @ 2007-09-13 11:24 ` Ingo Molnar 2007-09-14 8:32 ` Ingo Molnar 1 sibling, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2007-09-13 11:24 UTC (permalink / raw) To: Antoine Martin; +Cc: Linux Kernel Development * Antoine Martin <antoine@nagafix.co.uk> wrote: > Basically, all the previous kernels are pretty close (2.6.16 through > to 2.6.20 performed almost identically to 2.6.22 and are not shown > here to avoid cluttering the graphs) > > All the 2.6.23-rc kernels performed poorly (except -rc3!): much more > erratically and with a sharp performance drop above 800 threads. The > load starts to go up and the performance takes a nosedive. hm, could you try the patch below ontop of 2.6.23-rc6 and do: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround does this improve the numbers? Ingo --------------> Subject: sched: yield debugging From: Ingo Molnar <mingo@elte.hu> introduce various sched_yield implementations: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround tunability depends on CONFIG_SCHED_DEBUG=y. Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1392,10 +1392,12 @@ extern void sched_idle_next(void); #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; #endif Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ const_debug unsigned int sysctl_sched_la */ const_debug unsigned int sysctl_sched_child_runs_first = 1; +const_debug unsigned int sysctl_sched_yield_granularity = 10000000ULL; + +/* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +const_debug unsigned int sysctl_sched_yield_bug_workaround; + /* * Minimal preemption granularity for CPU-bound tasks: * (default: 2 msec, units: nanoseconds) @@ -675,15 +685,66 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); + if (!sysctl_sched_yield_bug_workaround) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + __update_rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + return; + } + + if (sysctl_sched_yield_bug_workaround == 1) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + + __update_rq_clock(rq); + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (unlikely(curr != first)) { + next = first; + } else { + next = rb_next(curr); + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + } + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + return; + } - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Just reschedule, do nothing else: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + resched_task(p); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -244,6 +244,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -266,6 +277,14 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_child_runs_first", .data = &sysctl_sched_child_runs_first, .maxlen = sizeof(unsigned int), ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading 2007-09-13 11:24 ` CFS: " Ingo Molnar @ 2007-09-14 8:32 ` Ingo Molnar 2007-09-14 10:06 ` Satyam Sharma 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2007-09-14 8:32 UTC (permalink / raw) To: Antoine Martin; +Cc: Linux Kernel Development, Peter Zijlstra * Ingo Molnar <mingo@elte.hu> wrote: > hm, could you try the patch below ontop of 2.6.23-rc6 and do: > > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround > > does this improve the numbers? the patch i sent was against CFS-devel. Could you try the one below, which is against vanilla -rc6, does it improve the numbers? (it should have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the sysctl. Ingo ------------------------> Subject: sched: yield debugging From: Ingo Molnar <mingo@elte.hu> introduce various sched_yield implementations: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround tunability depends on CONFIG_SCHED_DEBUG=y. Not-yet-signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read */ unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; +unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL; + +/* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_yield_bug_workaround; + /* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) @@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); + if (!sysctl_sched_yield_bug_workaround) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + __update_rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + return; + } + + if (sysctl_sched_yield_bug_workaround == 1) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + + __update_rq_clock(rq); + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (unlikely(curr != first)) { + next = first; + } else { + next = rb_next(curr); + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + } + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + return; + } - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Just reschedule, do nothing else: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + resched_task(p); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -244,6 +244,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -288,6 +299,14 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_child_runs_first", .data = &sysctl_sched_child_runs_first, .maxlen = sizeof(unsigned int), ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading 2007-09-14 8:32 ` Ingo Molnar @ 2007-09-14 10:06 ` Satyam Sharma 2007-09-14 15:25 ` CFS: some bad numbers with Java/database threading [FIXED] Antoine Martin 2007-09-14 16:01 ` CFS: some bad numbers with Java/database threading Satyam Sharma 0 siblings, 2 replies; 32+ messages in thread From: Satyam Sharma @ 2007-09-14 10:06 UTC (permalink / raw) To: Ingo Molnar; +Cc: Antoine Martin, Linux Kernel Development, Peter Zijlstra Hi Antoine, Ingo, On 9/14/07, Ingo Molnar <mingo@elte.hu> wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > hm, could you try the patch below ontop of 2.6.23-rc6 and do: > > > > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround > > > > does this improve the numbers? Hmm, I know diddly about Java, and I don't want to preempt Antoine's next test, but I noticed that he uses Thread.sleep() in his testcode and not the Thread.yield() so it would be interesting if Antoine can test with this patch and report if something shows up ... > the patch i sent was against CFS-devel. Could you try the one below, > which is against vanilla -rc6, does it improve the numbers? (it should > have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the > sysctl. On 9/13/07, Antoine Martin <antoine@nagafix.co.uk> wrote: > > All the 2.6.23-rc kernels performed poorly (except -rc3!): This is an interesting data point, IMHO ... considering these tests are long, I suspect you ran them only once each per kernel. So I wonder how reliable that -rc3 testpoint is. If this oddity is reproducible, it would be great if you could git-bisect: 1. between 23-rc1 and 23-rc3, and find out which commit led to the improvement in performance, and, 2. between 23-rc3 and 23-rc6, and find out which commit brought down the numbers again. [ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html, git-bisect is easy and amazingly helpful on certain occasions. ] > Notes about the tests and setup: > * environment is: > Dual Opteron 252 with 3GB ram, scsi disk, etc.. > Sun Java 1.6 > MySQL 5.0.44 > Junit + ant + my test code (devloop.org.uk) > * java threads are created first and the data is prepared, then all the > threads are started in a tight loop. Each thread runs multiple queries > with a 10ms pause (to allow the other threads to get scheduled) Don't know much about CFS either, but does that constant "10 ms" sleep somehow lead to evil synchronization issues between the test threads? Does randomizing that time (say from 2-20 ms) lead to different numbers? > * load average is divided by the number of cpus (2) > * more general information (which also covers some irrelevant > information about some other tests I have published) is here: > http://devloop.org.uk/documentation/database-performance/Setup/ Thanks, Satyam ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-14 10:06 ` Satyam Sharma @ 2007-09-14 15:25 ` Antoine Martin 2007-09-14 15:32 ` Ingo Molnar 2007-09-14 16:01 ` CFS: some bad numbers with Java/database threading Satyam Sharma 1 sibling, 1 reply; 32+ messages in thread From: Antoine Martin @ 2007-09-14 15:25 UTC (permalink / raw) To: Satyam Sharma; +Cc: Ingo Molnar, Linux Kernel Development, Peter Zijlstra -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Satyam Sharma wrote: > Hi Antoine, Ingo, > On 9/14/07, Ingo Molnar <mingo@elte.hu> wrote: >> * Ingo Molnar <mingo@elte.hu> wrote: >> >>> hm, could you try the patch below ontop of 2.6.23-rc6 and do: >>> >>> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround >>> >>> does this improve the numbers? > > Hmm, I know diddly about Java, and I don't want to preempt Antoine's next > test, but I noticed that he uses Thread.sleep() in his testcode and not the > Thread.yield() so it would be interesting if Antoine can test with this patch > and report if something shows up .. See below... I'll add a new test using yield() and see what that does. >> the patch i sent was against CFS-devel. Could you try the one below, >> which is against vanilla -rc6, does it improve the numbers? (it should >> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the >> sysctl. It looks good now! Updated results here: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png Compared with more kernels here - a bit more cluttered: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png Thanks Ingo! Does this mean that I'll have to keep doing: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround Or are you planning on finding a more elegant solution? # find /proc -name "*workaround*" /proc/sys/kernel/sched_yield_bug_workaround /proc/sys/net/ipv4/tcp_workaround_signed_windows > On 9/13/07, Antoine Martin <antoine@nagafix.co.uk> wrote: >> All the 2.6.23-rc kernels performed poorly (except -rc3!): > > This is an interesting data point, IMHO ... considering these tests are long, > I suspect you ran them only once each per kernel. So I wonder how reliable > that -rc3 testpoint is. If this oddity is reproducible, it would be great if you > could git-bisect: Yeah, I thought that was quite suspicious. - -rc2 is just like -rc1 (see above) so I'll re-test -rc3 first, git-bisect could take a while with those tests... just wiping the disk between tests takes about 30mins. >> * java threads are created first and the data is prepared, then all the >> threads are started in a tight loop. Each thread runs multiple queries >> with a 10ms pause (to allow the other threads to get scheduled) > > Don't know much about CFS either, but does that constant "10 ms" sleep > somehow lead to evil synchronization issues between the test threads? > Does randomizing that time (say from 2-20 ms) lead to different numbers? I've tested before with varying timings, but I had not thought of using a randomized delay. Will add that too. Many thanks to you all for the feedback! Antoine -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG6qfkGK2zHPGK1rsRCgeEAJ9HUUtHUNScvTVKo5z2sSmo+G+BVgCfdYmK rcd1VYUuzQA2oFEmakjZxgM= =jmI8 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-14 15:25 ` CFS: some bad numbers with Java/database threading [FIXED] Antoine Martin @ 2007-09-14 15:32 ` Ingo Molnar 2007-09-18 17:00 ` Chuck Ebbert 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2007-09-14 15:32 UTC (permalink / raw) To: Antoine Martin; +Cc: Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Antoine Martin <antoine@nagafix.co.uk> wrote: > >> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the > >> sysctl. > It looks good now! Updated results here: > http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png > http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png > Compared with more kernels here - a bit more cluttered: > http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png > > Thanks Ingo! > Does this mean that I'll have to keep doing: > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround > Or are you planning on finding a more elegant solution? just to make sure - can you get it to work fast with the -rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested) solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've attached below the latest version of the -rc6 yield patch - the switch is not dependent on SCHED_DEBUG anymore but always available. Ingo -------------------> Subject: sched: yield workaround From: Ingo Molnar <mingo@elte.hu> sched_yield() is fundamentally broken, and CFS has changed its behavior. Some apps that mistakenly rely on sched_yield() want "weak" sched yield (such as desktop apps) - while some apps want "strong" sched_yield() (for example some JDKs). There's no way for the scheduler to figure out which of the two variants the app really wants - because sched_yield() is all about hiding from the kernel the true structure of the user-space locking code. As a solution, provide a workaround, to introduce a more agressive sched_yield implementation: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround in the future, the use of this sysctl might generate a deprecation warning, so that apps start moving away from their reliance on sched_yield(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read */ unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; +unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL; + +/* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_yield_bug_workaround; + /* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) @@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); + if (!sysctl_sched_yield_bug_workaround) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + __update_rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + return; + } + + if (sysctl_sched_yield_bug_workaround == 1) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + + __update_rq_clock(rq); + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (curr != first) + next = rb_next(curr); + else + next = first; + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + return; + } - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Just reschedule, do nothing else: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + resched_task(p); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -244,6 +244,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -288,6 +299,14 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_child_runs_first", .data = &sysctl_sched_child_runs_first, .maxlen = sizeof(unsigned int), ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-14 15:32 ` Ingo Molnar @ 2007-09-18 17:00 ` Chuck Ebbert 2007-09-18 22:46 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Chuck Ebbert @ 2007-09-18 17:00 UTC (permalink / raw) To: Ingo Molnar Cc: Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra On 09/14/2007 11:32 AM, Ingo Molnar wrote: > * Antoine Martin <antoine@nagafix.co.uk> wrote: > >>>> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the >>>> sysctl. >> It looks good now! Updated results here: >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png >> Compared with more kernels here - a bit more cluttered: >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png >> >> Thanks Ingo! >> Does this mean that I'll have to keep doing: >> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround >> Or are you planning on finding a more elegant solution? > > just to make sure - can you get it to work fast with the > -rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested) > solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've > attached below the latest version of the -rc6 yield patch - the switch > is not dependent on SCHED_DEBUG anymore but always available. > Is this going to be merged? And will you be making the default == 1 or just leaving it at 0, which forces people who want the older behavior to modify the default? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-18 17:00 ` Chuck Ebbert @ 2007-09-18 22:46 ` Ingo Molnar 2007-09-18 23:02 ` Chuck Ebbert 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2007-09-18 22:46 UTC (permalink / raw) To: Chuck Ebbert Cc: Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Chuck Ebbert <cebbert@redhat.com> wrote: > On 09/14/2007 11:32 AM, Ingo Molnar wrote: > > * Antoine Martin <antoine@nagafix.co.uk> wrote: > > > >>>> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the > >>>> sysctl. > >> It looks good now! Updated results here: > >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png > >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png > >> Compared with more kernels here - a bit more cluttered: > >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png > >> > >> Thanks Ingo! > >> Does this mean that I'll have to keep doing: > >> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround > >> Or are you planning on finding a more elegant solution? > > > > just to make sure - can you get it to work fast with the > > -rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested) > > solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've > > attached below the latest version of the -rc6 yield patch - the switch > > is not dependent on SCHED_DEBUG anymore but always available. > > > > Is this going to be merged? And will you be making the default == 1 or > just leaving it at 0, which forces people who want the older behavior > to modify the default? not at the moment - Antoine suggested that the workload is probably fine and the patch against -rc6 would have no clear effect anyway so we have nothing to merge right now. (Note that there's no "older behavior" possible, unless we want to emulate all of the O(1) scheduler's behavior.) But ... we could still merge something like that patch, but a clearer testcase is needed. The JVM's i have access to work fine. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-18 22:46 ` Ingo Molnar @ 2007-09-18 23:02 ` Chuck Ebbert 2007-09-19 18:45 ` David Schwartz 2007-09-19 19:18 ` Ingo Molnar 0 siblings, 2 replies; 32+ messages in thread From: Chuck Ebbert @ 2007-09-18 23:02 UTC (permalink / raw) To: Ingo Molnar Cc: Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra On 09/18/2007 06:46 PM, Ingo Molnar wrote: >>> We need a (tested) >>> solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've >>> attached below the latest version of the -rc6 yield patch - the switch >>> is not dependent on SCHED_DEBUG anymore but always available. >>> >> Is this going to be merged? And will you be making the default == 1 or >> just leaving it at 0, which forces people who want the older behavior >> to modify the default? > > not at the moment - Antoine suggested that the workload is probably fine > and the patch against -rc6 would have no clear effect anyway so we have > nothing to merge right now. (Note that there's no "older behavior" > possible, unless we want to emulate all of the O(1) scheduler's > behavior.) But ... we could still merge something like that patch, but a > clearer testcase is needed. The JVM's i have access to work fine. I just got a bug report today: https://bugzilla.redhat.com/show_bug.cgi?id=295071 ================================================== Description of problem: The CFS scheduler does not seem to implement sched_yield correctly. If one program loops with a sched_yield and another program prints out timing information in a loop. You will see that if both are taskset to the same core that the timing stats will be twice as long as when they are on different cores. This problem was not in 2.6.21-1.3194 but showed up in 2.6.22.4-65 and continues in the newest released kernel 2.6.22.5-76. Version-Release number of selected component (if applicable): 2.6.22.4-65 through 2.6.22.5-76 How reproducible: Very Steps to Reproduce: compile task1 int main() { while (1) { sched_yield(); } return 0; } and compile task2 #include <stdio.h> #include <sys/time.h> int main() { while (1) { int i; struct timeval t0,t1; double usec; gettimeofday(&t0, 0); for (i = 0; i < 100000000; ++i) ; gettimeofday(&t1, 0); usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 + t0.tv_usec); printf ("%8.0f\n", usec); } return 0; } Then run: "taskset -c 0 ./task1" "taskset -c 0 ./task2" You will see that both tasks use 50% of the CPU. Then kill task2 and run: "taskset -c 1 ./task2" Now task2 will run twice as fast verifying that it is not some anomaly with the way top calculates CPU usage with sched_yield. Actual results: Tasks with sched_yield do not yield like they are suppose to. Expected results: The sched_yield task's CPU usage should go to near 0% when another task is on the same CPU. ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-18 23:02 ` Chuck Ebbert @ 2007-09-19 18:45 ` David Schwartz 2007-09-19 19:48 ` Chris Friesen 2007-09-19 19:18 ` Ingo Molnar 1 sibling, 1 reply; 32+ messages in thread From: David Schwartz @ 2007-09-19 18:45 UTC (permalink / raw) To: Linux-Kernel@Vger. Kernel. Org > The CFS scheduler does not seem to implement sched_yield correctly. If one > program loops with a sched_yield and another program prints out timing > information in a loop. You will see that if both are taskset to > the same core > that the timing stats will be twice as long as when they are on > different cores. > This problem was not in 2.6.21-1.3194 but showed up in > 2.6.22.4-65 and continues > in the newest released kernel 2.6.22.5-76. I disagree with the bug report. > You will see that both tasks use 50% of the CPU. > Then kill task2 and run: > "taskset -c 1 ./task2" This seems right. They're both always ready to run. They're at the same priority. Neither ever blocks. There is no reason one should get more CPU than the other. > Now task2 will run twice as fast verifying that it is not some > anomaly with the > way top calculates CPU usage with sched_yield. > > Actual results: > Tasks with sched_yield do not yield like they are suppose to. Umm, how does he get that? It's yielding at blinding speed. > Expected results: > The sched_yield task's CPU usage should go to near 0% when > another task is on > the same CPU. Nonsense. The task is always ready-to-run. There is no reason its CPU should be low. This bug report is based on a misunderstanding of what yielding means. The Linux page says: "A process can relinquish the processor voluntarily without blocking by calling sched_yield(). The process will then be moved to the end of the queue for its static priority and a new process gets to run." Notice the "without blocking" part? POSIX says: "The sched_yield() function forces the running thread to relinquish the processor until it again becomes the head of its thread list. It takes no arguments." CFS is perfectly complying with both of these. This bug report is a great example of how sched_yield can be misunderstood and misused. You can even argue that the sched_yield process should get even more CPU, since it's voluntarily relinquishing (which should be rewarded) rather than infinitely spinning (which should be punished). (Not that I agree with this argument, I'm just using it to counter-balance the other argument.) DS ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 18:45 ` David Schwartz @ 2007-09-19 19:48 ` Chris Friesen 2007-09-19 22:56 ` David Schwartz 0 siblings, 1 reply; 32+ messages in thread From: Chris Friesen @ 2007-09-19 19:48 UTC (permalink / raw) To: davids; +Cc: Linux-Kernel@Vger. Kernel. Org David Schwartz wrote: > Nonsense. The task is always ready-to-run. There is no reason its CPU should > be low. This bug report is based on a misunderstanding of what yielding > means. The yielding task has given up the cpu. The other task should get to run for a timeslice (or whatever the equivalent is in CFS) until the yielding task again "becomes head of the thread list". Chris ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 19:48 ` Chris Friesen @ 2007-09-19 22:56 ` David Schwartz 2007-09-19 23:05 ` David Schwartz 0 siblings, 1 reply; 32+ messages in thread From: David Schwartz @ 2007-09-19 22:56 UTC (permalink / raw) To: CFRIESEN; +Cc: Linux-Kernel@Vger. Kernel. Org > David Schwartz wrote: > > Nonsense. The task is always ready-to-run. There is no reason > > its CPU should > > be low. This bug report is based on a misunderstanding of what yielding > > means. > The yielding task has given up the cpu. The other task should get to > run for a timeslice (or whatever the equivalent is in CFS) until the > yielding task again "becomes head of the thread list". Are you sure this isn't happening? I'll run some tests on my SMP system running CFS. But I'll bet the context switch rate is quite rapid. Honestly, I can't imagine what else could be happening here. Does CFS spin in a loop doing nothing when you call sched_yield even though another task is ready-to-run? That seems kind of bizarre. Is sched_yield acting as a no-op? DS ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 22:56 ` David Schwartz @ 2007-09-19 23:05 ` David Schwartz 2007-09-19 23:52 ` David Schwartz 0 siblings, 1 reply; 32+ messages in thread From: David Schwartz @ 2007-09-19 23:05 UTC (permalink / raw) To: CFRIESEN; +Cc: Linux-Kernel@Vger. Kernel. Org Chris Friesen wrote: > > The yielding task has given up the cpu. The other task should get to > > run for a timeslice (or whatever the equivalent is in CFS) until the > > yielding task again "becomes head of the thread list". > Are you sure this isn't happening? I'll run some tests on my SMP > system running CFS. But I'll bet the context switch rate is quite rapid. Yep, that's exactly what's happening. The tasks are alternating. They are both always ready-to-run. The yielding task is put at the end of the queue for its priority level. There is no reason the yielding task should get less CPU since they're both always ready-to-run. The only downside here is that a yielding task results in very small timeslices which causes cache inefficiencies. A sane lower bound on the timeslice might be a good idea. But there is no semantic problem. DS ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 23:05 ` David Schwartz @ 2007-09-19 23:52 ` David Schwartz 0 siblings, 0 replies; 32+ messages in thread From: David Schwartz @ 2007-09-19 23:52 UTC (permalink / raw) To: CFRIESEN; +Cc: Linux-Kernel@Vger. Kernel. Org Ack, sorry, I'm wrong. Please ignore me, if you weren't already. I'm glad to hear this will be fixed. The task should be moved last for its priority level. DS ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-18 23:02 ` Chuck Ebbert 2007-09-19 18:45 ` David Schwartz @ 2007-09-19 19:18 ` Ingo Molnar 2007-09-19 19:39 ` Linus Torvalds 2007-09-19 20:00 ` CFS: some bad numbers with Java/database threading [FIXED] Chris Friesen 1 sibling, 2 replies; 32+ messages in thread From: Ingo Molnar @ 2007-09-19 19:18 UTC (permalink / raw) To: Chuck Ebbert Cc: Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra, Linus Torvalds * Chuck Ebbert <cebbert@redhat.com> wrote: > I just got a bug report today: > > https://bugzilla.redhat.com/show_bug.cgi?id=295071 > > ================================================== > > Description of problem: > > The CFS scheduler does not seem to implement sched_yield correctly. If > one program loops with a sched_yield and another program prints out > timing information in a loop. You will see that if both are taskset to > the same core that the timing stats will be twice as long as when they > are on different cores. This problem was not in 2.6.21-1.3194 but > showed up in 2.6.22.4-65 and continues in the newest released kernel > 2.6.22.5-76. sched_yield is a very poorly defined interface as it does not tell the kernel anything about what kind of locking user-space does. When an app calls the syscall it basically tells the scheduler: "uhm, ok, i'm blocked and i want you to do something else now. I dont want to tell you how long this task wants to wait, and i dont want to tell you on what condition it should be unblocked. Just ... do some stuff and let me alone. See ya." That's not an intelligent interface upon which the scheduler can do anything particularly intelligent (in fact, it's a very stupid interface upon which the scheduler can only do stupid things), and hence schedulers tend to implement sched_yield() quite arbitrarily and in a very scheduler-internals dependent way - which internals change when the scheduler is changed materially. The correct way to tell the kernel that the task is blocked is to use futexes for example, or any kernel-based locking or wait object - there are myriads of APIs for these. (The only well-defined behavior of yield is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.) Regarding compatible behavior: it's not possible to fully emulate the O(1) scheduler's yield behavior, because the yield implementation depended on so many scheduler internals. We changed yield when we went from 2.4 to 2.6, and even in 2.6 we changed it a number of times. To avoid the reoccuring problems of applications mistakenly relying on sched_yield(), we now context-switch on yield very weakly for SCHED_OTHER tasks (SCHED_FIFO/RR behavior is unchanged) - this is the only sane behavior that will get apps to stop using sched_yield() - and that's the difference that the above testcase shows. (there's actual user-space code executed, instead of the frequent overscheduling.) My patch below adds a sysctl flag that triggers a context-switch when yield is called (how futile and undefined and broken that might be), but that would only be for legacy reasons. We could still add this patch, but i was reluctant to send it to Linus without having at least one application that relies on having it and that benefits from it (Antoine's test turned out to not rely on it - see the '[FIXED]' qualifier in the subject line). Linus, what do you think? I have no strong feelings, I think the patch cannot hurt (it does not change anything by default) - but we should not turn the workaround flag on by default. If you agree that we should do this, then please pull this single patch from the sched.git tree: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git i've tested it with the flag off and on as well, and it works as expected. Ingo ------------------> Ingo Molnar (1): sched: yield workaround include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Ingo ------------------------------> Subject: sched: yield workaround From: Ingo Molnar <mingo@elte.hu> sched_yield() is fundamentally broken, and CFS has changed its behavior. Some apps that mistakenly rely on sched_yield() want "weak" sched yield (such as desktop apps) - while some apps want "strong" sched_yield() (for example some JDKs). There's no way for the scheduler to figure out which of the two variants the app really wants - because sched_yield() is all about hiding from the kernel the true structure of the user-space locking code. As a solution, provide a workaround, to introduce a more agressive sched_yield implementation: # default one: echo 0 > /proc/sys/kernel/sched_yield_bug_workaround # always queues the current task next to the next task: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround # NOP: echo 2 > /proc/sys/kernel/sched_yield_bug_workaround in the future, the use of this sysctl might generate a deprecation warning, so that apps start moving away from their reliance on sched_yield(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 + kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 19 +++++++++++++ 3 files changed, 88 insertions(+), 6 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1402,10 +1402,12 @@ extern void sched_idle_next(void); extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; +extern unsigned int sysctl_sched_yield_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_yield_bug_workaround; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read */ unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; +unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL; + +/* + * sys_sched_yield workaround switch. + * + * This option switches the yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_yield_bug_workaround; + /* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) @@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { - struct cfs_rq *cfs_rq = task_cfs_rq(p); + if (!sysctl_sched_yield_bug_workaround) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + __update_rq_clock(rq); + + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + return; + } + + if (sysctl_sched_yield_bug_workaround == 1) { + struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node *curr, *next, *first; + struct task_struct *p_next; + s64 yield_key; + + __update_rq_clock(rq); + curr = &p->se.run_node; + first = first_fair(cfs_rq); + /* + * Move this task to the second place in the tree: + */ + if (curr != first) + next = rb_next(curr); + else + next = first; + /* + * We were the last one already - nothing to do, return + * and reschedule: + */ + if (unlikely(!next)) + return; + + p_next = rb_entry(next, struct task_struct, se.run_node); + /* + * Minimally necessary key value to be the second in the tree: + */ + yield_key = p_next->se.fair_key + + (int)sysctl_sched_yield_granularity; + + dequeue_entity(cfs_rq, &p->se, 0); + + /* + * Only update the key if we need to move more backwards + * than the minimally necessary position to be the second: + */ + if (p->se.fair_key < yield_key) + p->se.fair_key = yield_key; + + __enqueue_entity(cfs_rq, &p->se); + return; + } - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Just reschedule, do nothing else: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + resched_task(p); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -244,6 +244,17 @@ static ctl_table kern_table[] = { }, { .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_granularity_ns", + .data = &sysctl_sched_yield_granularity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &min_sched_granularity_ns, + .extra2 = &max_sched_granularity_ns, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "sched_wakeup_granularity_ns", .data = &sysctl_sched_wakeup_granularity, .maxlen = sizeof(unsigned int), @@ -303,6 +314,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_yield_bug_workaround", + .data = &sysctl_sched_yield_bug_workaround, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 19:18 ` Ingo Molnar @ 2007-09-19 19:39 ` Linus Torvalds 2007-09-19 19:56 ` Ingo Molnar 2007-09-19 20:00 ` CFS: some bad numbers with Java/database threading [FIXED] Chris Friesen 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2007-09-19 19:39 UTC (permalink / raw) To: Ingo Molnar Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra On Wed, 19 Sep 2007, Ingo Molnar wrote: > > Linus, what do you think? I have no strong feelings, I think the patch > cannot hurt (it does not change anything by default) - but we should not > turn the workaround flag on by default. I disagree. I think CFS made "sched_yield()" worse, and what you call "bug workaround" is likely the *better* behaviour. The fact is, sched_yield() is not - and should not be - about "recalculating the position in the scheduler queue" like you do now in CFS. It very much is about moving the thread *dead last* within its priority group. That's what it does for round-robin, and it's not about fairness, it's about - Opengroup: DESCRIPTION The sched_yield() function forces the running thread to relinquish the processor until it again becomes the head of its thread list. It takes no arguments. - Linux man-page: DESCRIPTION A process can relinquish the processor voluntarily without blocking by calling sched_yield. The process will then be moved to the end of the queue for its static priority and a new process gets to run. and quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the "right place" in the rbtree. It should move it *last*. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 19:39 ` Linus Torvalds @ 2007-09-19 19:56 ` Ingo Molnar 2007-09-19 20:26 ` Ingo Molnar 2007-09-19 20:28 ` Linus Torvalds 0 siblings, 2 replies; 32+ messages in thread From: Ingo Molnar @ 2007-09-19 19:56 UTC (permalink / raw) To: Linus Torvalds Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Linus, what do you think? I have no strong feelings, I think the > > patch cannot hurt (it does not change anything by default) - but we > > should not turn the workaround flag on by default. > > I disagree. I think CFS made "sched_yield()" worse, and what you call > "bug workaround" is likely the *better* behaviour. > > The fact is, sched_yield() is not - and should not be - about > "recalculating the position in the scheduler queue" like you do now in > CFS. > > It very much is about moving the thread *dead last* within its > priority group. [...] > and quite frankly, the current CFS behaviour simply looks buggy. It > should simply not move it to the "right place" in the rbtree. It > should move it *last*. ok, we can do that. the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task. so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. in CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_. another practical problem is that this will break certain desktop apps that do calls to yield() [some firefox plugins do that, some 3D apps do that, etc.] but they dont expect to be moved 'very late' into the queue - they expect the O(1) scheduler's behavior of being delayed "a bit". (That's why i added the yield-granularity tunable.) we can make yield super-agressive, that is pretty much the only sane (because well-defined) thing to do (besides turning yield into a NOP), but there will be lots of regression reports about lost interactivity during load. sched_yield() is a mortally broken API. "fix the app" would be the answer, but still there will be lots of complaints. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 19:56 ` Ingo Molnar @ 2007-09-19 20:26 ` Ingo Molnar 2007-09-19 20:28 ` Linus Torvalds 1 sibling, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2007-09-19 20:26 UTC (permalink / raw) To: Linus Torvalds Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Ingo Molnar <mingo@elte.hu> wrote: > ok, we can do that. > > the O(1) implementation of yield() was pretty arbitrary: it did not > move it last on the same priority level - it only did it within the > active array. So expired tasks (such as CPU hogs) would come _after_ a > yield()-ing task. > > so the yield() implementation was so much tied to the data structures > of the O(1) scheduler that it was impossible to fully emulate it in > CFS. > > in CFS we dont have a per-nice-level rbtree, so we cannot move it dead > last within the same priority group - but we can move it dead last in > the whole tree. (then they'd be put even after nice +19 tasks.) People > might complain about _that_. > > another practical problem is that this will break certain desktop apps > that do calls to yield() [some firefox plugins do that, some 3D apps > do that, etc.] but they dont expect to be moved 'very late' into the > queue - they expect the O(1) scheduler's behavior of being delayed "a > bit". (That's why i added the yield-granularity tunable.) > > we can make yield super-agressive, that is pretty much the only sane > (because well-defined) thing to do (besides turning yield into a NOP), > but there will be lots of regression reports about lost interactivity > during load. sched_yield() is a mortally broken API. "fix the app" > would be the answer, but still there will be lots of complaints. find below the fix that puts yielding tasks to the rightmost position in the rbtree. I have not tested it extensively yet, but it appears to work on x86. (i tested yield using interactive tasks and they get hurt now under load - but this would be expected.) Ingo ----------------------> Subject: sched: make yield more agressive From: Ingo Molnar <mingo@elte.hu> make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 5 +---- kernel/sched_fair.c | 39 ++++++++++++++++++++++++++++++++++----- 2 files changed, 35 insertions(+), 9 deletions(-) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void) struct rq *rq = this_rq_lock(); schedstat_inc(rq, yld_cnt); - if (unlikely(rq->nr_running == 1)) - schedstat_inc(rq, yld_act_empty); - else - current->sched_class->yield_task(rq, current); + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -902,14 +902,43 @@ static void dequeue_task_fair(struct rq static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; + struct sched_entity *rightmost, *se = &p->se; + struct rb_node *parent; - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Are we the only task in the tree? */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + if (unlikely(cfs_rq->nr_running == 1)) + return; + /* + * Find the rightmost entry in the rbtree: + */ + do { + parent = *link; + link = &parent->rb_right; + } while (*link); + + rightmost = rb_entry(parent, struct sched_entity, run_node); + /* + * Already in the rightmost position? + */ + if (unlikely(rightmost == se)) + return; + + /* + * Minimally necessary key value to be last in the tree: + */ + se->fair_key = rightmost->fair_key + 1; + + if (cfs_rq->rb_leftmost == &se->run_node) + cfs_rq->rb_leftmost = rb_next(&se->run_node); + /* + * Relink the task to the rightmost position: + */ + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); + rb_link_node(&se->run_node, parent, link); + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); } /* ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 19:56 ` Ingo Molnar 2007-09-19 20:26 ` Ingo Molnar @ 2007-09-19 20:28 ` Linus Torvalds 2007-09-19 21:41 ` Ingo Molnar 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2007-09-19 20:28 UTC (permalink / raw) To: Ingo Molnar Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra On Wed, 19 Sep 2007, Ingo Molnar wrote: > > in CFS we dont have a per-nice-level rbtree, so we cannot move it dead > last within the same priority group - but we can move it dead last in > the whole tree. (then they'd be put even after nice +19 tasks.) People > might complain about _that_. Yeah, with reasonably good reason. The sched_yield() behaviour is actually very well-defined for RT tasks (now, whether it's a good interface to use or not is still open to debate, but at least it's a _defined_ interface ;), and I think we should at least try to approximate that behaviour for normal tasks, even if they aren't RT. And the closest you can get is basically to say something that is close to "sched_yield()" on a non-RT process still means that all other runnable tasks in that same nice-level will be scheduled before the task is scheduled again. and I think that would be the optimal approximation of the RT behaviour. Now, quite understandably we might not be able to actually get that *exact* behaviour (since we mix up different nice levels), but I think from a QoI standpoint we should really have that as a "target" behaviour. So I think it should be moved back in the RB-tree, but really preferably only back to behind all other processes at the same nice-level. So I think we have two problems with yield(): - it really doesn't have very nice/good semantics in the first place for anything but strict round-robin RT tasks. We can't do much about this problem, apart from trying to make people use proper locking and other models that do *not* depend on yield(). - Linux has made the problem a bit worse by then having fairly arbitrary and different semantics over time. This is where I think it would be a good idea to try to have the above kind of "this is the ideal behaviour - we don't guarantee it being precise, but at least we'd have some definition of what people should be able to expect is the "best" behaviour. So I don't think a "globally last" option is at all the best thing, but I think it's likely better than what CFS does now. And if we then have some agreement on what would be considered a further _improvement_, then that would be a good thing - maybe we're not perfect, but at least having a view of what we want to do is a good idea. Btw, the "enqueue at the end" could easily be a statistical thing, not an absolute thing. So it's possible that we could perhaps implement the CFS "yield()" using the same logic as we have now, except *not* calling the "update_stats()" stuff: __dequeue_entity(..); __enqueue_entity(..); and then just force the "fair_key" of the to something that *statistically* means that it should be at the end of its nice queue. I dunno. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 20:28 ` Linus Torvalds @ 2007-09-19 21:41 ` Ingo Molnar 2007-09-19 21:49 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Ingo Molnar @ 2007-09-19 21:41 UTC (permalink / raw) To: Linus Torvalds Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Linus Torvalds <torvalds@linux-foundation.org> wrote: > Btw, the "enqueue at the end" could easily be a statistical thing, not > an absolute thing. So it's possible that we could perhaps implement > the CFS "yield()" using the same logic as we have now, except *not* > calling the "update_stats()" stuff: > > __dequeue_entity(..); > __enqueue_entity(..); > > and then just force the "fair_key" of the to something that > *statistically* means that it should be at the end of its nice queue. > > I dunno. i thought a bit about the statistical approach, and it's good in principle, but it has an implementational problem/complication: if there are only yielding tasks in the system, then the "queue rightwards in the tree, statistically" approach cycles through the key-space artificially fast. That can cause various problems. (this means that the workload-flag patch that uses yield_granularity is buggy as well. The queue-rightmost patch did not have this problem.) So right now there are only two viable options i think: either do the current weak thing, or do the rightmost thing. The statistical method might work too, but it needs more thought and more testing - i'm not sure we can get that ready for 2.6.23. So what we have as working code right now is the two extremes, and apps will really mostly prefer either the first (if they dont truly want to use yield but somehow it got into their code) or the second (if they want some massive delay). So while it does not have a good QoI, how about doing a compat_yield sysctl that allows the turning on of the "queue rightmost" logic? Find tested patch below. Peter, what do you think? Linus, if this would be acceptable for .23 then i'll push it out into sched.git together with another fix that Hiroshi Shimamoto just posted to lkml. Ingo --------------------------------------> Subject: sched: add /proc/sys/kernel/sched_compat_yield From: Ingo Molnar <mingo@elte.hu> add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 1 kernel/sched.c | 5 --- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 67 insertions(+), 10 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_ extern unsigned int sysctl_sched_batch_wakeup_granularity; extern unsigned int sysctl_sched_stat_granularity; extern unsigned int sysctl_sched_runtime_limit; +extern unsigned int sysctl_sched_compat_yield; extern unsigned int sysctl_sched_child_runs_first; extern unsigned int sysctl_sched_features; Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void) struct rq *rq = this_rq_lock(); schedstat_inc(rq, yld_cnt); - if (unlikely(rq->nr_running == 1)) - schedstat_inc(rq, yld_act_empty); - else - current->sched_class->yield_task(rq, current); + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; /* + * sys_sched_yield() compat mode + * + * This option switches the agressive yield implementation of the + * old scheduler back on. + */ +unsigned int __read_mostly sysctl_sched_compat_yield; + +/* * SCHED_BATCH wake-up granularity. * (default: 25 msec, units: nanoseconds) * @@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq } /* - * sched_yield() support is very simple - we dequeue and enqueue + * sched_yield() support is very simple - we dequeue and enqueue. + * + * If compat_yield is turned on then we requeue to the end of the tree. */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { struct cfs_rq *cfs_rq = task_cfs_rq(p); + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; + struct sched_entity *rightmost, *se = &p->se; + struct rb_node *parent; - __update_rq_clock(rq); /* - * Dequeue and enqueue the task to update its - * position within the tree: + * Are we the only task in the tree? + */ + if (unlikely(cfs_rq->nr_running == 1)) + return; + + if (likely(!sysctl_sched_compat_yield)) { + __update_rq_clock(rq); + /* + * Dequeue and enqueue the task to update its + * position within the tree: + */ + dequeue_entity(cfs_rq, &p->se, 0); + enqueue_entity(cfs_rq, &p->se, 0); + + return; + } + /* + * Find the rightmost entry in the rbtree: */ - dequeue_entity(cfs_rq, &p->se, 0); - enqueue_entity(cfs_rq, &p->se, 0); + do { + parent = *link; + link = &parent->rb_right; + } while (*link); + + rightmost = rb_entry(parent, struct sched_entity, run_node); + /* + * Already in the rightmost position? + */ + if (unlikely(rightmost == se)) + return; + + /* + * Minimally necessary key value to be last in the tree: + */ + se->fair_key = rightmost->fair_key + 1; + + if (cfs_rq->rb_leftmost == &se->run_node) + cfs_rq->rb_leftmost = rb_next(&se->run_node); + /* + * Relink the task to the rightmost position: + */ + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); + rb_link_node(&se->run_node, parent, link); + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); } /* Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -303,6 +303,14 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sched_compat_yield", + .data = &sysctl_sched_compat_yield, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, #ifdef CONFIG_PROVE_LOCKING { .ctl_name = CTL_UNNUMBERED, ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 21:41 ` Ingo Molnar @ 2007-09-19 21:49 ` Ingo Molnar 2007-09-19 21:58 ` Peter Zijlstra 2007-09-26 1:46 ` CFS: new java yield graphs Antoine Martin 2 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2007-09-19 21:49 UTC (permalink / raw) To: Linus Torvalds Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Ingo Molnar <mingo@elte.hu> wrote: > Peter, what do you think? > > Linus, if this would be acceptable for .23 then i'll push it out into > sched.git together with another fix that Hiroshi Shimamoto just posted > to lkml. it's getting late here so i've pushed the current version of those two patches out to: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git I'll redo this git tree if we want some other solution for yield. (but i think this is the safest approach for 2.6.23 - some apps will complain about too strong yield, some apps will complain about too weak yield. So by providing the two extremes we at least cover the practical range of behavior.) there's nothing else pending for 2.6.23 otherwise at the moment, scheduler-wise. Ingo ------------------> Hiroshi Shimamoto (1): sched: fix invalid sched_class use Ingo Molnar (1): sched: add /proc/sys/kernel/sched_compat_yield include/linux/sched.h | 1 kernel/sched.c | 10 ++++--- kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sysctl.c | 8 ++++++ 4 files changed, 72 insertions(+), 10 deletions(-) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 21:41 ` Ingo Molnar 2007-09-19 21:49 ` Ingo Molnar @ 2007-09-19 21:58 ` Peter Zijlstra 2007-09-26 1:46 ` CFS: new java yield graphs Antoine Martin 2 siblings, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2007-09-19 21:58 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development On Wed, 19 Sep 2007 23:41:05 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Btw, the "enqueue at the end" could easily be a statistical thing, not > > an absolute thing. So it's possible that we could perhaps implement > > the CFS "yield()" using the same logic as we have now, except *not* > > calling the "update_stats()" stuff: > > > > __dequeue_entity(..); > > __enqueue_entity(..); > > > > and then just force the "fair_key" of the to something that > > *statistically* means that it should be at the end of its nice queue. > > > > I dunno. > > i thought a bit about the statistical approach, and it's good in > principle, but it has an implementational problem/complication: if there > are only yielding tasks in the system, then the "queue rightwards in the > tree, statistically" approach cycles through the key-space artificially > fast. That can cause various problems. (this means that the > workload-flag patch that uses yield_granularity is buggy as well. The > queue-rightmost patch did not have this problem.) > > So right now there are only two viable options i think: either do the > current weak thing, or do the rightmost thing. The statistical method > might work too, but it needs more thought and more testing - i'm not > sure we can get that ready for 2.6.23. > > So what we have as working code right now is the two extremes, and apps > will really mostly prefer either the first (if they dont truly want to > use yield but somehow it got into their code) or the second (if they > want some massive delay). So while it does not have a good QoI, how > about doing a compat_yield sysctl that allows the turning on of the > "queue rightmost" logic? Find tested patch below. > > Peter, what do you think? I have to agree that for .23 we can't do much more than this. And tasks moving to the right without actually doing work and advancing fair_clock do scare me a little. Also, while I agree with Linus' definition of sched_yield, I'm afraid it will cause 'regressions' for all the interactivity people out there. Somehow this yield thing has made it into all sorts of unfortunate places like video drivers, so a heavy penalizing yield will hurt them. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: new java yield graphs 2007-09-19 21:41 ` Ingo Molnar 2007-09-19 21:49 ` Ingo Molnar 2007-09-19 21:58 ` Peter Zijlstra @ 2007-09-26 1:46 ` Antoine Martin 2007-09-27 8:35 ` Ingo Molnar 2 siblings, 1 reply; 32+ messages in thread From: Antoine Martin @ 2007-09-26 1:46 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Chuck Ebbert, Satyam Sharma, Linux Kernel Development, Peter Zijlstra -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 These are pure cpu scheduling tests, not doing any I/O this time. All these tests are still "pathological" in the sense that they are only meant to show differences between schedulers rather than try to simulate real usage scenarios. all the graphs are here: http://devloop.org.uk/lkml/ Legend: * 2.6.23-rc6-yield2: "yield2" patch is this one: http://lkml.org/lkml/2007/9/14/157 * 2.6.23-rc6-yield2-highlatency is the same patch, but the kernel is built without preempt, with HZ100 and the scheduling granularity is doubled using sysctl. * 2.6.23-rc6-yield3 is CFS-v21 combo3 + the yield patch * 2.6.23-rc6-yield4 is this patch (queued for mainline? and in mm?): http://lkml.org/lkml/2007/9/19/409 of interest I found: * rc6-mm1 is not always fair (see "ManyThreadsYieldOften" tests) - the only one to have almost half the threads already finished at the half way point when yielding often. Also slower for the "RandomSleep". * increasing latency makes a noticeable difference (see "ShortPause") it can be more fair, but it also makes it a lot more erratic (see "Yield" tests) * most changes are only noticeable with a large number for threads (look for 'ManyThreads' in the filename) Summary of the code: each thread loops 25 times, incrementing a shared counter each time and calling "iteration()" which does: * "Long Pause" tests: 1000 times sqrt() followed by 25 milliseconds sleep. * "Random Sleep" tests: sleep(n) after 1000 sqrt calculations, where n is a random number between 0 and 99 milliseconds. * "Short Pause" Tests: 1 million sqrt() followed by 5ms sleep. * "Yield Often" Tests: sqrt() then yield(), both 250 times per iteration. * "Yield" Tests: 1 million sqrt() followed by a single yield(). All the source code is here: http://bugs.devloop.org.uk/devloop/browser/metastores/examples-test/src/com/metastores/examples/perftest/ Antoine PS: now testing -rc8, also added a test that consumes memory in each thread. also now recording context switches and idle time. Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> Btw, the "enqueue at the end" could easily be a statistical thing, not >> an absolute thing. So it's possible that we could perhaps implement >> the CFS "yield()" using the same logic as we have now, except *not* >> calling the "update_stats()" stuff: >> >> __dequeue_entity(..); >> __enqueue_entity(..); >> >> and then just force the "fair_key" of the to something that >> *statistically* means that it should be at the end of its nice queue. >> >> I dunno. > > i thought a bit about the statistical approach, and it's good in > principle, but it has an implementational problem/complication: if there > are only yielding tasks in the system, then the "queue rightwards in the > tree, statistically" approach cycles through the key-space artificially > fast. That can cause various problems. (this means that the > workload-flag patch that uses yield_granularity is buggy as well. The > queue-rightmost patch did not have this problem.) > > So right now there are only two viable options i think: either do the > current weak thing, or do the rightmost thing. The statistical method > might work too, but it needs more thought and more testing - i'm not > sure we can get that ready for 2.6.23. > > So what we have as working code right now is the two extremes, and apps > will really mostly prefer either the first (if they dont truly want to > use yield but somehow it got into their code) or the second (if they > want some massive delay). So while it does not have a good QoI, how > about doing a compat_yield sysctl that allows the turning on of the > "queue rightmost" logic? Find tested patch below. > > Peter, what do you think? > > Linus, if this would be acceptable for .23 then i'll push it out into > sched.git together with another fix that Hiroshi Shimamoto just posted > to lkml. > > Ingo > > --------------------------------------> > Subject: sched: add /proc/sys/kernel/sched_compat_yield > From: Ingo Molnar <mingo@elte.hu> > > add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() > more agressive, by moving the yielding task to the last position > in the rbtree. > > with sched_compat_yield=0: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield > 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop > > with sched_compat_yield=1: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop > 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > include/linux/sched.h | 1 > kernel/sched.c | 5 --- > kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++----- > kernel/sysctl.c | 8 ++++++ > 4 files changed, 67 insertions(+), 10 deletions(-) > > Index: linux/include/linux/sched.h > =================================================================== > --- linux.orig/include/linux/sched.h > +++ linux/include/linux/sched.h > @@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_ > extern unsigned int sysctl_sched_batch_wakeup_granularity; > extern unsigned int sysctl_sched_stat_granularity; > extern unsigned int sysctl_sched_runtime_limit; > +extern unsigned int sysctl_sched_compat_yield; > extern unsigned int sysctl_sched_child_runs_first; > extern unsigned int sysctl_sched_features; > > Index: linux/kernel/sched.c > =================================================================== > --- linux.orig/kernel/sched.c > +++ linux/kernel/sched.c > @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void) > struct rq *rq = this_rq_lock(); > > schedstat_inc(rq, yld_cnt); > - if (unlikely(rq->nr_running == 1)) > - schedstat_inc(rq, yld_act_empty); > - else > - current->sched_class->yield_task(rq, current); > + current->sched_class->yield_task(rq, current); > > /* > * Since we are going to call schedule() anyway, there's > Index: linux/kernel/sched_fair.c > =================================================================== > --- linux.orig/kernel/sched_fair.c > +++ linux/kernel/sched_fair.c > @@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read > unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL; > > /* > + * sys_sched_yield() compat mode > + * > + * This option switches the agressive yield implementation of the > + * old scheduler back on. > + */ > +unsigned int __read_mostly sysctl_sched_compat_yield; > + > +/* > * SCHED_BATCH wake-up granularity. > * (default: 25 msec, units: nanoseconds) > * > @@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq > } > > /* > - * sched_yield() support is very simple - we dequeue and enqueue > + * sched_yield() support is very simple - we dequeue and enqueue. > + * > + * If compat_yield is turned on then we requeue to the end of the tree. > */ > static void yield_task_fair(struct rq *rq, struct task_struct *p) > { > struct cfs_rq *cfs_rq = task_cfs_rq(p); > + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node; > + struct sched_entity *rightmost, *se = &p->se; > + struct rb_node *parent; > > - __update_rq_clock(rq); > /* > - * Dequeue and enqueue the task to update its > - * position within the tree: > + * Are we the only task in the tree? > + */ > + if (unlikely(cfs_rq->nr_running == 1)) > + return; > + > + if (likely(!sysctl_sched_compat_yield)) { > + __update_rq_clock(rq); > + /* > + * Dequeue and enqueue the task to update its > + * position within the tree: > + */ > + dequeue_entity(cfs_rq, &p->se, 0); > + enqueue_entity(cfs_rq, &p->se, 0); > + > + return; > + } > + /* > + * Find the rightmost entry in the rbtree: > */ > - dequeue_entity(cfs_rq, &p->se, 0); > - enqueue_entity(cfs_rq, &p->se, 0); > + do { > + parent = *link; > + link = &parent->rb_right; > + } while (*link); > + > + rightmost = rb_entry(parent, struct sched_entity, run_node); > + /* > + * Already in the rightmost position? > + */ > + if (unlikely(rightmost == se)) > + return; > + > + /* > + * Minimally necessary key value to be last in the tree: > + */ > + se->fair_key = rightmost->fair_key + 1; > + > + if (cfs_rq->rb_leftmost == &se->run_node) > + cfs_rq->rb_leftmost = rb_next(&se->run_node); > + /* > + * Relink the task to the rightmost position: > + */ > + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); > + rb_link_node(&se->run_node, parent, link); > + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline); > } > > /* > Index: linux/kernel/sysctl.c > =================================================================== > --- linux.orig/kernel/sysctl.c > +++ linux/kernel/sysctl.c > @@ -303,6 +303,14 @@ static ctl_table kern_table[] = { > .proc_handler = &proc_dointvec, > }, > #endif > + { > + .ctl_name = CTL_UNNUMBERED, > + .procname = "sched_compat_yield", > + .data = &sysctl_sched_compat_yield, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > + }, > #ifdef CONFIG_PROVE_LOCKING > { > .ctl_name = CTL_UNNUMBERED, > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG+bn+GK2zHPGK1rsRCg9aAJ42JXUKKf5V5fkSy48sxDZwIjs95gCeLDQ/ QdKYGH3mW8Cubn5IcfpArYY= =mZPq -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: new java yield graphs 2007-09-26 1:46 ` CFS: new java yield graphs Antoine Martin @ 2007-09-27 8:35 ` Ingo Molnar 0 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2007-09-27 8:35 UTC (permalink / raw) To: Antoine Martin Cc: Linus Torvalds, Chuck Ebbert, Satyam Sharma, Linux Kernel Development, Peter Zijlstra * Antoine Martin <antoine@nagafix.co.uk> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > These are pure cpu scheduling tests, not doing any I/O this time. All > these tests are still "pathological" in the sense that they are only > meant to show differences between schedulers rather than try to > simulate real usage scenarios. thanks for testing this! > all the graphs are here: > http://devloop.org.uk/lkml/ wow - really nice graphs! > Legend: > * 2.6.23-rc6-yield2: "yield2" patch is this one: > http://lkml.org/lkml/2007/9/14/157 > * 2.6.23-rc6-yield2-highlatency is the same patch, but the kernel is > built without preempt, with HZ100 and the scheduling granularity is > doubled using sysctl. > * 2.6.23-rc6-yield3 is CFS-v21 combo3 + the yield patch > * 2.6.23-rc6-yield4 is this patch (queued for mainline? and in mm?): > http://lkml.org/lkml/2007/9/19/409 wrt. yield4 did you set /proc/sys/kernel/sched_compat_yield to 1? (with sched_compat_yield at 0, which is the default, nothing changes) which one would be your favorite kernel? To me yield4 looks pretty good. > of interest I found: > * rc6-mm1 is not always fair (see "ManyThreadsYieldOften" tests) - the > only one to have almost half the threads already finished at the half > way point when yielding often. Also slower for the "RandomSleep". > * increasing latency makes a noticeable difference (see "ShortPause") > it can be more fair, but it also makes it a lot more erratic (see > "Yield" tests) > * most changes are only noticeable with a large number for threads (look > for 'ManyThreads' in the filename) i'm wondering, how easy would it be for you to test the sched-devel.git tree? If you havent used git before then first install the 'git' package, then do: git-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6.git cd linux-2.6.git git-pull git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git > PS: now testing -rc8, also added a test that consumes memory in each > thread. also now recording context switches and idle time. ok. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading [FIXED] 2007-09-19 19:18 ` Ingo Molnar 2007-09-19 19:39 ` Linus Torvalds @ 2007-09-19 20:00 ` Chris Friesen 1 sibling, 0 replies; 32+ messages in thread From: Chris Friesen @ 2007-09-19 20:00 UTC (permalink / raw) To: Ingo Molnar Cc: Chuck Ebbert, Antoine Martin, Satyam Sharma, Linux Kernel Development, Peter Zijlstra, Linus Torvalds Ingo Molnar wrote: > > The correct way to tell the kernel that the task is blocked is to use > futexes for example, or any kernel-based locking or wait object - there > are myriads of APIs for these. (The only well-defined behavior of yield > is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.) Certainly this is reasonable for applications for which the source is available and readily recompilable. However, there are legacy closed-source apps out there expecting sched_yield() to result in a reasonable amount of time passing before the task is scheduled again. Also, there are installed bases of people that may have older versions of code that may wish to upgrade to newer kernels without upgrading the rest of the system. It seems odd to force them to update userspace apps just because we don't like the undefined semantics. > To avoid the reoccuring problems of applications mistakenly relying on > sched_yield(), we now context-switch on yield very weakly for > SCHED_OTHER tasks... <snip> > My patch below adds a sysctl flag that triggers a context-switch when > yield is called... <snip> I think the patch > cannot hurt (it does not change anything by default) - but we should not > turn the workaround flag on by default. If you agree that we should do > this, then please pull this single patch from the sched.git tree: I've always understood one of the kernel's basic tenets to be that we don't break userspace without a good reason. If there are apps out there that expect sched_yield() to give up the cpu, it seems counter-intuitive to ignore that expectation. Personally, I'd be in favour of making the context-switch be the default behaviour, but at the very least it should be possible to enable a "backwards-compatibility mode" for sched_yield(). Chris ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading 2007-09-14 10:06 ` Satyam Sharma 2007-09-14 15:25 ` CFS: some bad numbers with Java/database threading [FIXED] Antoine Martin @ 2007-09-14 16:01 ` Satyam Sharma 2007-09-14 16:08 ` Satyam Sharma 2007-09-17 12:17 ` Antoine Martin 1 sibling, 2 replies; 32+ messages in thread From: Satyam Sharma @ 2007-09-14 16:01 UTC (permalink / raw) To: Antoine Martin Cc: Ingo Molnar, Linux Kernel Development, Peter Zijlstra, Nick Piggin, David Schwartz [ Argh, just noticed this thread got broke and had been living a parallel life due to Subject: changes, dropped Cc:'s, and munged In-Reply-To:'s. Adding back all interested folk here. ] > Hi Antoine, Ingo, > > > On 9/14/07, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > > hm, could you try the patch below ontop of 2.6.23-rc6 and do: > > > > > > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround > > > > > > does this improve the numbers? > > Hmm, I know diddly about Java, and I don't want to preempt Antoine's next > test, but I noticed that he uses Thread.sleep() in his testcode and not the > Thread.yield() so it would be interesting if Antoine can test with this patch > and report if something shows up ... Ok, it appears Antoine tested with the patch and got: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield.png http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield-withload.png which leads me to believe this probably wasn't a yield problem after all, though it would still be useful if someone with more knowledge of Java could give that code a look over ... Curiously, the -rc3 oddity is still plainly visible there -- how do we explain that? Ingo, does that oddity (and the commits that went in around -rc3 time) give some clue as to the behaviour / characteristics of Antoine's workloads? > > the patch i sent was against CFS-devel. Could you try the one below, > > which is against vanilla -rc6, does it improve the numbers? (it should > > have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the > > sysctl. > > > On 9/13/07, Antoine Martin <antoine@nagafix.co.uk> wrote: > > > > All the 2.6.23-rc kernels performed poorly (except -rc3!): > > This is an interesting data point, IMHO ... considering these tests are long, > I suspect you ran them only once each per kernel. So I wonder how reliable > that -rc3 testpoint is. If this oddity is reproducible, Ok, it's reproducible, making our job easier. Which also means, Antoine, please do try the following git-bisecting: > 1. between 23-rc1 and 23-rc3, and find out which commit led to the > improvement in performance, and, > 2. between 23-rc3 and 23-rc6, and find out which commit brought down > the numbers again. > > [ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html, > git-bisect is easy and amazingly helpful on certain occasions. ] I don't have access to any real/meaningful SMP systems, so I wonder how much sense it makes in practical terms for me to try and run these tests locally on my little boxen ... would be helpful if someone with 4/8 CPU systems could give Antoine's testsuite a whirl :-) > > Notes about the tests and setup: > > * environment is: > > Dual Opteron 252 with 3GB ram, scsi disk, etc.. > > Sun Java 1.6 > > MySQL 5.0.44 > > Junit + ant + my test code (devloop.org.uk) > > * java threads are created first and the data is prepared, then all the > > threads are started in a tight loop. Each thread runs multiple queries > > with a 10ms pause (to allow the other threads to get scheduled) > > Don't know much about CFS either, but does that constant "10 ms" sleep > somehow lead to evil synchronization issues between the test threads? > Does randomizing that time (say from 2-20 ms) lead to different numbers? Umm, you mention _changing_ this value earlier, but it still remains the same for every thread during every loop for a given test run -- what I'm suggesting is making that code do something like: Thread.sleep(random(x, y)); where x=2, y=20 and random(x, y) returns a random integer between x and y, so all threads sleep for different durations in every loop, but still average out to about ~10 ms over a period. Try to vary x and y (to vary the average) and post the resulting graphs too? CONFIG_HZ (actually, full .config) and dmesg might be useful for us as well. Also, like David mentioned, counting the _number_ of times the test threads managed to execute those SQL queries is probably a better benchmark than measuring the time it takes for threads to finish execution itself -- uniformity in that number across threads would bring out how "fair" CFS is compared to previous kernels, for one ... And finally I need a clarification from you: from the code that I read, it appears you have *different* threads for purely inserting/selecting/updating/deleting records from the table, right? So I think David was barking up the wrong tree in that other reply there, where he said the *same* thread needs to execute multiple queries on the same data, and therefore your test code is susceptible to cache hotness and thread execution order effects ... but I don't see any such pathologies. I could be wrong, of course ... Which brings us to another issue -- how well does the testsuite capture real world workloads? Wouldn't multiple threads in the real world that arbitrarily execute insert/update/select/delete queries on the same table also need to implement some form of locking? How would that affect the numbers? > > * load average is divided by the number of cpus (2) > > * more general information (which also covers some irrelevant > > information about some other tests I have published) is here: > > http://devloop.org.uk/documentation/database-performance/Setup/ > > > Thanks, > > Satyam > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading 2007-09-14 16:01 ` CFS: some bad numbers with Java/database threading Satyam Sharma @ 2007-09-14 16:08 ` Satyam Sharma 2007-09-17 12:17 ` Antoine Martin 1 sibling, 0 replies; 32+ messages in thread From: Satyam Sharma @ 2007-09-14 16:08 UTC (permalink / raw) To: Antoine Martin Cc: Ingo Molnar, Linux Kernel Development, Peter Zijlstra, Nick Piggin, David Schwartz On 9/14/07, Satyam Sharma <satyam.sharma@gmail.com> wrote: > [snip] Ick, threading breaks in Gmail with Subject: changes, so I missed the latest updates on this thread ... oh well, never mind. Satyam ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: CFS: some bad numbers with Java/database threading 2007-09-14 16:01 ` CFS: some bad numbers with Java/database threading Satyam Sharma 2007-09-14 16:08 ` Satyam Sharma @ 2007-09-17 12:17 ` Antoine Martin 1 sibling, 0 replies; 32+ messages in thread From: Antoine Martin @ 2007-09-17 12:17 UTC (permalink / raw) To: Satyam Sharma Cc: Ingo Molnar, Linux Kernel Development, Peter Zijlstra, Nick Piggin, David Schwartz -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Satyam Sharma wrote: > I don't have access to any real/meaningful SMP systems, so I wonder > how much sense it makes in practical terms for me to try and run these > tests locally on my little boxen ... would be helpful if someone with > 4/8 CPU systems could give Antoine's testsuite a whirl :-) It should still work... if you're patient. [snip] covered in separate mail > CONFIG_HZ (actually, full .config) and dmesg might be > useful for us as well. http://devloop.org.uk/documentation/database-performance/dmesg-dualopteron.gz CONFIG_HZ=1000 > Also, like David mentioned, counting the _number_ of times the test > threads managed to execute those SQL queries is probably a better > benchmark than measuring the time it takes for threads to finish > execution itself -- uniformity in that number across threads would > bring out how "fair" CFS is compared to previous kernels, for one ... Good idea, I've added code to capture more fairness oriented attributes: * time until the first thread finishes - this should actually be higher when the scheduler is fairer. * total number of loop iterations executed when the first thread finishes (higher is better) etc. > And finally I need a clarification from you: from the code that I > read, it appears you have *different* threads for purely > inserting/selecting/updating/deleting records from the table, right? Correct. > So I think David was barking up the wrong tree > in that other reply there, where he said the *same* thread needs to > execute multiple queries on the same data, and therefore your test > code is susceptible to cache hotness and thread execution order > effects ... but I don't see any such pathologies. I could be wrong, of > course ... I think he does have a valid point: maybe, the locality of the java threads is important, but there are so many of them and so many layers below it (jdbc pool, backend database threads, filesystem, device) that the various caches get thrashed quite often anyway, no matter what the interleaving factor is. Which means that being fairer in this case also means switching threads more often and hitting the caches' capacity limits earlier. In which case the granularity and CONFIG_HZ should have a noticeable impact. Ingo did suggest tweaking /proc/sys/kernel/sched_yield_granularity_ns, which I did but this particular test timed out when It ran over the weekend... will run it again now. (results in a couple of days) Maybe the granularity should auto-adjust to prevent such cases? (probably not) Granularity (and therefore fairness) become less important as the number of threads becomes very large, as fairness starts to adversely affects throughput? I have changed my mind and now think that this is not a regression, or at least not one we should worry too much about. As David said, this workload is "pathological". And CFS, does show some noticeable improvements with the new measurements (now using a shorter thread yield of 5ms and a higher number of loop iterations per thread: 25): It does not adversely affect throughput as much (will test with more threads later): http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests8-5msYield.png The number of threads that exit before we reach half the total number of loop iterations is lower and more predictable (meaning a more even distribution between all the threads): http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests8-5msYield-ThreadsFinished.png And the time it takes to reach this half way point is also more consistent: http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests8-5msYield-TimeToHalfWay.png "2.6.23-rc6-yield2" is the yield patch meant for 2.6.23, "2.6.23-rc6-yield3" is the one that is not meant to be merged. Both include: echo 1 > /proc/sys/kernel/sched_yield_bug_workaround The high latency case (coming soon) includes HZ250, no preempt, and doubles the /proc/sys/kernel/sched_yield_granularity_ns > Which brings us to another issue -- how well does the testsuite > capture real world workloads? Not well... This particular test is not meant to represent any real-life scenario - it is very very hard not to have a bias, (that is why there are so many variations of TPC: TPC-W, TPC-H,..) It is just meant to try to measure the impact of changing individual components (anything from kernels, databases, threads, coding techniques, ...) one at a time, running a variety of scenarios and spotting any anomalies. More often than not, the anomalies will tell you about which combinations to avoid, and not about unexpected improvements. > Wouldn't multiple threads in the real world that arbitrarily > execute insert/update/select/delete queries on the same table also > need to implement some form of locking? Generally yes, but not always - I have a personal preference for lock-less algorithms (like variations on optimistic locking) because they're a lot easier to get right (and understand), but they're not always suitable. (mostly because of ordering/timing/QoS issues) And I digress. > How would that affect the numbers? Hmmm. Not entirely sure, if the locking is implemented using database transactions, the threads would spend a bit longer in wait state whilst the database does its stuff (even more so for the databases that do not use row-level locking), switching tasks less, so it might actually help improve throughput in some cases (as long as the workload is cpu-bound rather than I/O bound) ? We'll know more when I get the results for the high latency case. Antoine -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG7nBCGK2zHPGK1rsRCrs3AJ9e5Ye1KVgydd06aD6akoLe5Z2RYQCfQey2 PpQYOdsvdo5bLE68ro/KFbE= =c21f -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2007-09-27 8:35 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-09-12 23:10 CFS: some bad numbers with Java/database threading Antoine Martin 2007-09-13 7:18 ` David Schwartz 2007-09-12 23:33 ` Nick Piggin 2007-09-13 19:02 ` Antoine Martin 2007-09-13 21:47 ` David Schwartz 2007-09-13 11:24 ` CFS: " Ingo Molnar 2007-09-14 8:32 ` Ingo Molnar 2007-09-14 10:06 ` Satyam Sharma 2007-09-14 15:25 ` CFS: some bad numbers with Java/database threading [FIXED] Antoine Martin 2007-09-14 15:32 ` Ingo Molnar 2007-09-18 17:00 ` Chuck Ebbert 2007-09-18 22:46 ` Ingo Molnar 2007-09-18 23:02 ` Chuck Ebbert 2007-09-19 18:45 ` David Schwartz 2007-09-19 19:48 ` Chris Friesen 2007-09-19 22:56 ` David Schwartz 2007-09-19 23:05 ` David Schwartz 2007-09-19 23:52 ` David Schwartz 2007-09-19 19:18 ` Ingo Molnar 2007-09-19 19:39 ` Linus Torvalds 2007-09-19 19:56 ` Ingo Molnar 2007-09-19 20:26 ` Ingo Molnar 2007-09-19 20:28 ` Linus Torvalds 2007-09-19 21:41 ` Ingo Molnar 2007-09-19 21:49 ` Ingo Molnar 2007-09-19 21:58 ` Peter Zijlstra 2007-09-26 1:46 ` CFS: new java yield graphs Antoine Martin 2007-09-27 8:35 ` Ingo Molnar 2007-09-19 20:00 ` CFS: some bad numbers with Java/database threading [FIXED] Chris Friesen 2007-09-14 16:01 ` CFS: some bad numbers with Java/database threading Satyam Sharma 2007-09-14 16:08 ` Satyam Sharma 2007-09-17 12:17 ` Antoine Martin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox