* x264 benchmarks BFS vs CFS @ 2009-12-17 9:33 Kasper Sandberg 2009-12-17 10:42 ` Jason Garrett-Glaser 0 siblings, 1 reply; 34+ messages in thread From: Kasper Sandberg @ 2009-12-17 9:33 UTC (permalink / raw) To: Ingo Molnar; +Cc: LKML Mailinglist well well :) nothing quite speaks out like graphs.. http://doom10.org/index.php?topic=78.0 regards, Kasper Sandberg ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 9:33 x264 benchmarks BFS vs CFS Kasper Sandberg @ 2009-12-17 10:42 ` Jason Garrett-Glaser 2009-12-17 10:53 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Jason Garrett-Glaser @ 2009-12-17 10:42 UTC (permalink / raw) To: Kasper Sandberg; +Cc: Ingo Molnar, LKML Mailinglist On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > well well :) nothing quite speaks out like graphs.. > > http://doom10.org/index.php?topic=78.0 > > > > regards, > Kasper Sandberg Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied it--and given the strict thread-ordering expectations of x264, you basically can't expect it to do any better, though I'm curious what's responsible for the gap in "veryslow", even with SCHED_BATCH enabled. The most odd case is that of "ultrafast", in which CFS immediately ties BFS when we enable SCHED_BATCH. We're doing some further testing to see exactly what the conditions of this are--is it because ultrafast is just so much faster than all the other modes and so switches threads/loads faster? Is it because ultrafast has relatively equal workload among the threads, unlike the other loads? We'll probably know soon. Jason ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 10:42 ` Jason Garrett-Glaser @ 2009-12-17 10:53 ` Ingo Molnar 2009-12-17 11:00 ` Kasper Sandberg 0 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2009-12-17 10:53 UTC (permalink / raw) To: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra Cc: Kasper Sandberg, LKML Mailinglist * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > well well :) nothing quite speaks out like graphs.. > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > regards, > > Kasper Sandberg > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > it--and given the strict thread-ordering expectations of x264, you basically > can't expect it to do any better, though I'm curious what's responsible for > the gap in "veryslow", even with SCHED_BATCH enabled. > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > when we enable SCHED_BATCH. We're doing some further testing to see exactly > what the conditions of this are--is it because ultrafast is just so much > faster than all the other modes and so switches threads/loads faster? Is it > because ultrafast has relatively equal workload among the threads, unlike > the other loads? We'll probably know soon. Thanks for testing it! Btw., you might want to make use of 'perf sched record', 'perf sched map', 'perf sched trace' etc. to get an insight into how a particular workload schedules and why those decisions are done. (You'll need CONFIG_SCHED_DEBUG=y for best results.) Thanks, Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 10:53 ` Ingo Molnar @ 2009-12-17 11:00 ` Kasper Sandberg 2009-12-17 12:08 ` Ingo Molnar ` (3 more replies) 0 siblings, 4 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-17 11:00 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > well well :) nothing quite speaks out like graphs.. > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > regards, > > > Kasper Sandberg > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > > it--and given the strict thread-ordering expectations of x264, you basically > > can't expect it to do any better, though I'm curious what's responsible for > > the gap in "veryslow", even with SCHED_BATCH enabled. > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > > when we enable SCHED_BATCH. We're doing some further testing to see exactly Thats kinda besides the point. all these tunables and weirdness is _NEVER_ going to work for people. now forgive me for being so blunt, but for a user, having to do echo x264 > /proc/cfs/gief_me_performance_on_app or echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO user tuning, so while cfs may be able to nearly match up with a ton of application specific stuff, that just doesnt work for a normal user. not to mention that bfs does this whilst not loosing interactivity, something which cfs certainly cannot boast. <snip> > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 11:00 ` Kasper Sandberg @ 2009-12-17 12:08 ` Ingo Molnar 2009-12-17 12:35 ` Kasper Sandberg 2009-12-17 15:47 ` Arjan van de Ven 2009-12-17 13:30 ` Mike Galbraith ` (2 subsequent siblings) 3 siblings, 2 replies; 34+ messages in thread From: Ingo Molnar @ 2009-12-17 12:08 UTC (permalink / raw) To: Kasper Sandberg Cc: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds * Kasper Sandberg <lkml@metanurb.dk> wrote: > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > regards, > > > > Kasper Sandberg > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > > > it--and given the strict thread-ordering expectations of x264, you basically > > > can't expect it to do any better, though I'm curious what's responsible for > > > the gap in "veryslow", even with SCHED_BATCH enabled. > > > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > Thats kinda besides the point. > > all these tunables and weirdness is _NEVER_ going to work for people. v2.6.32 improved quite a bit on the x264 front so i dont think that's necessarily the case. But yes, i'll subscribe to the view that we cannot satisfy everything all the time. There's tradeoffs in every scheduler design. > now forgive me for being so blunt, but for a user, having to do > echo x264 > /proc/cfs/gief_me_performance_on_app > or > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO > user tuning, so while cfs may be able to nearly match up with a ton of > application specific stuff, that just doesnt work for a normal user. > > not to mention that bfs does this whilst not loosing interactivity, > something which cfs certainly cannot boast. What kind of latencies are those? Arent they just compiz induced due to different weighting of workloads in BFS and in the upstream scheduler? Would you be willing to help us out pinning them down? To move the discussion to the numeric front please send the 'perf sched latency' output of an affected workload. Thanks, Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 12:08 ` Ingo Molnar @ 2009-12-17 12:35 ` Kasper Sandberg 2009-12-17 15:47 ` Arjan van de Ven 1 sibling, 0 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-17 12:35 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, 2009-12-17 at 13:08 +0100, Ingo Molnar wrote: > * Kasper Sandberg <lkml@metanurb.dk> wrote: > > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > > > > > regards, > > > > > Kasper Sandberg > > > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > > > > it--and given the strict thread-ordering expectations of x264, you basically > > > > can't expect it to do any better, though I'm curious what's responsible for > > > > the gap in "veryslow", even with SCHED_BATCH enabled. > > > > > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > > > Thats kinda besides the point. > > > > all these tunables and weirdness is _NEVER_ going to work for people. > > v2.6.32 improved quite a bit on the x264 front so i dont think that's > necessarily the case. again, pretty much application specific, and furthermore, ONLY with SCHED_BATCH is it near BFS. as you know, SCHED_BATCH isnt exactly what you wanna do for desktop or other interactivity-hungry tasks? bfs manages better performance than cfs with SCHED_BATCH, without SCHED_BATCH > > But yes, i'll subscribe to the view that we cannot satisfy everything all the > time. There's tradeoffs in every scheduler design. yet getting not even as good on average performance from CFS as BFS, requires tunables, swtiching scheduler policies etc > > > now forgive me for being so blunt, but for a user, having to do > > echo x264 > /proc/cfs/gief_me_performance_on_app > > or > > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app > > > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO > > user tuning, so while cfs may be able to nearly match up with a ton of > > application specific stuff, that just doesnt work for a normal user. ^^^^ This is also something you need to consider. > > > > not to mention that bfs does this whilst not loosing interactivity, > > something which cfs certainly cannot boast. > > What kind of latencies are those? Arent they just compiz induced due to > different weighting of workloads in BFS and in the upstream scheduler? > Would you be willing to help us out pinning them down? Theres not much i can do, i dont have time to switch kernels on my systems, all i can give you is this simple information, on my systems, ranging from embedded to dual core2 quad and core i7, BFS manages to give lower latencies(aka jack doesnt skip with very low latency output, everythings smoother, even measurably on the desktop), greater performance(as evidenced by lots of benchmarks, including those i posted), and that is without touching a single scheduler policy or tunable at all. Im well aware that CFS can be tweaked via tunables/policies to achieve a single of these goals at a time, and im also well aware you cannot ever do every single cornercase perfectly with one scheduler, however, and consider this very thoroughly, bfs manages without any tunables, to do the vast majority of the cases with an excellence CFS can not even 100% match, even tunables and scheduler polices.. and that is with ALOT less code aswell.. This ought to tell you that something can and should be done. > > To move the discussion to the numeric front please send the 'perf sched > latency' output of an affected workload. > > Thanks, > > Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 12:08 ` Ingo Molnar 2009-12-17 12:35 ` Kasper Sandberg @ 2009-12-17 15:47 ` Arjan van de Ven 1 sibling, 0 replies; 34+ messages in thread From: Arjan van de Ven @ 2009-12-17 15:47 UTC (permalink / raw) To: Ingo Molnar Cc: Kasper Sandberg, Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, 17 Dec 2009 13:08:26 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > > > not to mention that bfs does this whilst not loosing interactivity, > > something which cfs certainly cannot boast. > > What kind of latencies are those? Arent they just compiz induced due > to different weighting of workloads in BFS and in the upstream > scheduler? Would you be willing to help us out pinning them down? > > To move the discussion to the numeric front please send the 'perf > sched latency' output of an affected workload. CFS in .32 and before has one known, and now fixed latency issue. In .32, wake_up() (which is most causes for inter thread communication and lots of others) was trying to keep the waker and wakee on the same logical cpu at pretty much all cost. In .33-git, Mike fixed this to, if there's a free logical cpu sibling, or on a multicore cpu, another core which shares the cache, to just schedule the new task on that free cpu rather than on the current, guaranteed busy, cpu. This change helps latency a lot, and as a result, performance for various latency sensitive workloads... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 11:00 ` Kasper Sandberg 2009-12-17 12:08 ` Ingo Molnar @ 2009-12-17 13:30 ` Mike Galbraith 2009-12-18 10:54 ` Kasper Sandberg 2009-12-17 21:22 ` Thomas Fjellstrom 2009-12-18 1:18 ` Jason Garrett-Glaser 3 siblings, 1 reply; 34+ messages in thread From: Mike Galbraith @ 2009-12-17 13:30 UTC (permalink / raw) To: Kasper Sandberg Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote: > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > regards, > > > > Kasper Sandberg > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > > > it--and given the strict thread-ordering expectations of x264, you basically > > > can't expect it to do any better, though I'm curious what's responsible for > > > the gap in "veryslow", even with SCHED_BATCH enabled. > > > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > Thats kinda besides the point. > > all these tunables and weirdness is _NEVER_ going to work for people. Fact is, it is working for a great number of people, the vast majority of whom don't even know where the knobs are, much less what they do. > now forgive me for being so blunt, but for a user, having to do > echo x264 > /proc/cfs/gief_me_performance_on_app > or > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app Theatrics noted. > just isnt usable, bfs matches, even exceeds cfs on all accounts, with > ZERO user tuning, so while cfs may be able to nearly match up with a ton > of application specific stuff, that just doesnt work for a normal user. Seems you haven't done much benchmarking. BFS has strengths as well as weaknesses, all schedulers do. > not to mention that bfs does this whilst not loosing interactivity, > something which cfs certainly cannot boast. Not true. I sent Con hard evidence of a severe problem area wrt interactivity, and hard numbers showing other places where BFS needs some work. But hey, if BFS blows your skirt up, use it and be happy. -Mike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 13:30 ` Mike Galbraith @ 2009-12-18 10:54 ` Kasper Sandberg 2009-12-18 11:41 ` Mike Galbraith 0 siblings, 1 reply; 34+ messages in thread From: Kasper Sandberg @ 2009-12-18 10:54 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, 2009-12-17 at 14:30 +0100, Mike Galbraith wrote: > On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote: > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > > > > > regards, > > > > > Kasper Sandberg > > > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > > > > it--and given the strict thread-ordering expectations of x264, you basically > > > > can't expect it to do any better, though I'm curious what's responsible for > > > > the gap in "veryslow", even with SCHED_BATCH enabled. > > > > > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > > > Thats kinda besides the point. > > > > all these tunables and weirdness is _NEVER_ going to work for people. > > Fact is, it is working for a great number of people, the vast majority > of whom don't even know where the knobs are, much less what they do. but not as great as it could be :) > > > now forgive me for being so blunt, but for a user, having to do > > echo x264 > /proc/cfs/gief_me_performance_on_app > > or > > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app > > Theatrics noted. > > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with > > ZERO user tuning, so while cfs may be able to nearly match up with a ton > > of application specific stuff, that just doesnt work for a normal user. > > Seems you haven't done much benchmarking. BFS has strengths as well as > weaknesses, all schedulers do. yeah, BFS just has more strengths and fewer weaknesses than CFS :) > > > not to mention that bfs does this whilst not loosing interactivity, > > something which cfs certainly cannot boast. > > Not true. I sent Con hard evidence of a severe problem area wrt > interactivity, and hard numbers showing other places where BFS needs > some work. But hey, if BFS blows your skirt up, use it and be happy. Theatrics noted. As for your point, well.. as far as i have heard, all you've come up with is COMPLETELY WORTHLESS use cases which nobody is ever EVAR going to do, and thus irellevant > > -Mike > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 10:54 ` Kasper Sandberg @ 2009-12-18 11:41 ` Mike Galbraith 0 siblings, 0 replies; 34+ messages in thread From: Mike Galbraith @ 2009-12-18 11:41 UTC (permalink / raw) To: Kasper Sandberg Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 2009-12-18 at 11:54 +0100, Kasper Sandberg wrote: > On Thu, 2009-12-17 at 14:30 +0100, Mike Galbraith wrote: > > On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote: > > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > > > > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > > > > > > > > > regards, > > > > > > Kasper Sandberg > > > > > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > > > > > it--and given the strict thread-ordering expectations of x264, you basically > > > > > can't expect it to do any better, though I'm curious what's responsible for > > > > > the gap in "veryslow", even with SCHED_BATCH enabled. > > > > > > > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > > > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > > > > > Thats kinda besides the point. > > > > > > all these tunables and weirdness is _NEVER_ going to work for people. > > > > Fact is, it is working for a great number of people, the vast majority > > of whom don't even know where the knobs are, much less what they do. > but not as great as it could be :) > > > > > > now forgive me for being so blunt, but for a user, having to do > > > echo x264 > /proc/cfs/gief_me_performance_on_app > > > or > > > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app > > > > Theatrics noted. > > > > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with > > > ZERO user tuning, so while cfs may be able to nearly match up with a ton > > > of application specific stuff, that just doesnt work for a normal user. > > > > Seems you haven't done much benchmarking. BFS has strengths as well as > > weaknesses, all schedulers do. > yeah, BFS just has more strengths and fewer weaknesses than CFS :) > > > > > not to mention that bfs does this whilst not loosing interactivity, > > > something which cfs certainly cannot boast. > > > > Not true. I sent Con hard evidence of a severe problem area wrt > > interactivity, and hard numbers showing other places where BFS needs > > some work. But hey, if BFS blows your skirt up, use it and be happy. > Theatrics noted. > > As for your point, well.. as far as i have heard, all you've come up > with is COMPLETELY WORTHLESS use cases which nobody is ever EVAR going > to do, and thus irellevant Goodbye troll. *PLONK* ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 11:00 ` Kasper Sandberg 2009-12-17 12:08 ` Ingo Molnar 2009-12-17 13:30 ` Mike Galbraith @ 2009-12-17 21:22 ` Thomas Fjellstrom 2009-12-18 10:56 ` Kasper Sandberg 2009-12-18 1:18 ` Jason Garrett-Glaser 3 siblings, 1 reply; 34+ messages in thread From: Thomas Fjellstrom @ 2009-12-17 21:22 UTC (permalink / raw) To: linux-kernel On Thu December 17 2009, Kasper Sandberg wrote: > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > regards, > > > > Kasper Sandberg > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically > > > tied it--and given the strict thread-ordering expectations of x264, > > > you basically can't expect it to do any better, though I'm curious > > > what's responsible for the gap in "veryslow", even with SCHED_BATCH > > > enabled. > > > > > > The most odd case is that of "ultrafast", in which CFS immediately > > > ties BFS when we enable SCHED_BATCH. We're doing some further > > > testing to see exactly > > Thats kinda besides the point. > > all these tunables and weirdness is _NEVER_ going to work for people. > > now forgive me for being so blunt, but for a user, having to do > echo x264 > /proc/cfs/gief_me_performance_on_app > or > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with > ZERO user tuning, so while cfs may be able to nearly match up with a ton > of application specific stuff, that just doesnt work for a normal user. > > not to mention that bfs does this whilst not loosing interactivity, > something which cfs certainly cannot boast. > > <snip> Strange, I seem to recall that BFS needs you to run apps with some silly schedtool program to get media apps to not skip while doing other tasks. (I don't have to tweak CFS at all) > > Thanks, > > > > Ingo -- Thomas Fjellstrom tfjellstrom@shaw.ca ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 21:22 ` Thomas Fjellstrom @ 2009-12-18 10:56 ` Kasper Sandberg 0 siblings, 0 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-18 10:56 UTC (permalink / raw) To: tfjellstrom; +Cc: linux-kernel On Thu, 2009-12-17 at 14:22 -0700, Thomas Fjellstrom wrote: > On Thu December 17 2009, Kasper Sandberg wrote: > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> > wrote: > > > > > well well :) nothing quite speaks out like graphs.. > > > > > > > > > > http://doom10.org/index.php?topic=78.0 > > > > > > > > > > > > > > > > > > > > regards, > > > > > Kasper Sandberg > > > > > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically > > > > tied it--and given the strict thread-ordering expectations of x264, > > > > you basically can't expect it to do any better, though I'm curious > > > > what's responsible for the gap in "veryslow", even with SCHED_BATCH > > > > enabled. > > > > > > > > The most odd case is that of "ultrafast", in which CFS immediately > > > > ties BFS when we enable SCHED_BATCH. We're doing some further > > > > testing to see exactly > > > > Thats kinda besides the point. > > > > all these tunables and weirdness is _NEVER_ going to work for people. > > > > now forgive me for being so blunt, but for a user, having to do > > echo x264 > /proc/cfs/gief_me_performance_on_app > > or > > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app > > > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with > > ZERO user tuning, so while cfs may be able to nearly match up with a ton > > of application specific stuff, that just doesnt work for a normal user. > > > > not to mention that bfs does this whilst not loosing interactivity, > > something which cfs certainly cannot boast. > > > > <snip> > > Strange, I seem to recall that BFS needs you to run apps with some silly > schedtool program to get media apps to not skip while doing other tasks. (I > don't have to tweak CFS at all) You recall incorrectly > > > > Thanks, > > > > > > Ingo > > > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-17 11:00 ` Kasper Sandberg ` (2 preceding siblings ...) 2009-12-17 21:22 ` Thomas Fjellstrom @ 2009-12-18 1:18 ` Jason Garrett-Glaser 2009-12-18 5:23 ` Ingo Molnar 2009-12-18 10:56 ` Kasper Sandberg 3 siblings, 2 replies; 34+ messages in thread From: Jason Garrett-Glaser @ 2009-12-18 1:18 UTC (permalink / raw) To: Kasper Sandberg Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: >> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: >> >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: >> > > well well :) nothing quite speaks out like graphs.. >> > > >> > > http://doom10.org/index.php?topic=78.0 >> > > >> > > >> > > >> > > regards, >> > > Kasper Sandberg >> > >> > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied >> > it--and given the strict thread-ordering expectations of x264, you basically >> > can't expect it to do any better, though I'm curious what's responsible for >> > the gap in "veryslow", even with SCHED_BATCH enabled. >> > >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS >> > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > Thats kinda besides the point. > > all these tunables and weirdness is _NEVER_ going to work for people. Can't individually applications request SCHED_BATCH? Our plan was to have x264 simply detect if it was necessary (once we figure out what encoding settings result in the large gap situation) and automatically enable it for the current application. Jason ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 1:18 ` Jason Garrett-Glaser @ 2009-12-18 5:23 ` Ingo Molnar 2009-12-18 7:30 ` Mike Galbraith 2009-12-18 10:56 ` Kasper Sandberg 1 sibling, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2009-12-18 5:23 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Kasper Sandberg, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > >> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > >> > >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > >> > > well well :) nothing quite speaks out like graphs.. > >> > > > >> > > http://doom10.org/index.php?topic=78.0 > >> > > > >> > > > >> > > > >> > > regards, > >> > > Kasper Sandberg > >> > > >> > Yeah, I sent this to Mike a bit ago. ?Seems that .32 has basically tied > >> > it--and given the strict thread-ordering expectations of x264, you basically > >> > can't expect it to do any better, though I'm curious what's responsible for > >> > the gap in "veryslow", even with SCHED_BATCH enabled. > >> > > >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > >> > when we enable SCHED_BATCH. ?We're doing some further testing to see exactly > > > > Thats kinda besides the point. > > > > all these tunables and weirdness is _NEVER_ going to work for people. > > Can't individually applications request SCHED_BATCH? Our plan was to have > x264 simply detect if it was necessary (once we figure out what encoding > settings result in the large gap situation) and automatically enable it for > the current application. Yeah, SCHED_BATCH can be requested at will by an app. It's an unprivileged operation. It gets passed down to child tasks. (You can just do it unconditionally - older kernels will ignore it and give you an error code for setscheduler call.) Having said that, we generally try to make things perform well without apps having to switch themselves to SCHED_BATCH. Mike, do you think we can make x264 perform as well (or nearly as well) under SCHED_OTHER as under SCHED_BATCH? Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 5:23 ` Ingo Molnar @ 2009-12-18 7:30 ` Mike Galbraith 2009-12-18 10:11 ` Jason Garrett-Glaser 2009-12-18 10:57 ` Kasper Sandberg 0 siblings, 2 replies; 34+ messages in thread From: Mike Galbraith @ 2009-12-18 7:30 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Garrett-Glaser, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: > Having said that, we generally try to make things perform well without apps > having to switch themselves to SCHED_BATCH. Mike, do you think we can make > x264 perform as well (or nearly as well) under SCHED_OTHER as under > SCHED_BATCH? It's not bad as is, except for ultrafast mode. START_DEBIT is the biggest problem there. I don't think SCHED_OTHER will ever match SCHED_BATCH for this load, though I must say I haven't full-spectrum tested. This load really wants RR scheduling, and wakeup preemption necessarily perturbs run order. I'll probably piddle with it some more, it's an interesting load. -Mike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 7:30 ` Mike Galbraith @ 2009-12-18 10:11 ` Jason Garrett-Glaser 2009-12-18 12:49 ` Mike Galbraith 2009-12-18 10:57 ` Kasper Sandberg 1 sibling, 1 reply; 34+ messages in thread From: Jason Garrett-Glaser @ 2009-12-18 10:11 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 3234 bytes --] On Thu, Dec 17, 2009 at 11:30 PM, Mike Galbraith <efault@gmx.de> wrote: > On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: > >> Having said that, we generally try to make things perform well without apps >> having to switch themselves to SCHED_BATCH. Mike, do you think we can make >> x264 perform as well (or nearly as well) under SCHED_OTHER as under >> SCHED_BATCH? > > It's not bad as is, except for ultrafast mode. START_DEBIT is the > biggest problem there. I don't think SCHED_OTHER will ever match > SCHED_BATCH for this load, though I must say I haven't full-spectrum > tested. This load really wants RR scheduling, and wakeup preemption > necessarily perturbs run order. > > I'll probably piddle with it some more, it's an interesting load. > > -Mike > > Two more thoughts here: 1) We're considering moving to a thread pool soon; we already have a working patch for it and if anything it'll save a few clocks spent on nice()ing threads and other such things. Will this improve START_DEBIT at all? I've attached the beta patch if you want to try it. Note this also works with 2) as well, so it adds yet another dimension to what's mentioned below. 2) We recently implemented a new threading model which may be interesting to test as well. This threading model gives worse compression *and* performance, but has one benefit: it adds zero latency, whereas normal threading adds a full frame of latency per thread. This was paid for by a company interested in ultra-low-latency streaming applications, where 1 millisecond is a huge deal. I've been thinking this might be interesting to bench from a kernel perspective as well, as when you're spawning a half-dozen threads and need them all done within 6 milliseconds, you start getting down to serious scheduler issues. The new threading model is much less complex than the regular one and works as follows. The frame is split into X slices, and each slice encoded with one thread. Specifically, it works via the following process: 1. Preprocess input frame, perform lookahead analysis on input frame (all singlethreaded) 2. Split up a ton of threads to do the main encode, one per slice. 3. Join all the threads. 4. Do post-filtering on the output frame, return. Clearly this is an utter disaster, since it spawns N times as many threads as the old threading model *and* they last far shorter, *and* only part of the application is multithreaded. But there's not really a better way to do low-latency threading, and it's an interesting challenge to boot. IIRC, it's also the way ffmpeg's encoder threading works. It's widely considered an inferior model, but as mentioned before, in this particular use-case there's no choice. To enable this, use --sliced-threads. I'd recommend using a higher-resolution clip for this, as it performs atrociously bad on very low resolution videos for reasons you might be able to guess. If you need a higher-res clip, check the SD or HD ones here: http://media.xiph.org/video/derf/ . I'm personally curious as to what kind of scheduler issues this results in--I haven't done any BFS vs CFS tests with this option enabled yet. Jason [-- Attachment #2: thread_pool_slices.diff --] [-- Type: application/octet-stream, Size: 13295 bytes --] diff --git a/common/common.h b/common/common.h index 417ac9e..28d6c1d 100644 --- a/common/common.h +++ b/common/common.h @@ -337,12 +337,20 @@ struct x264_t /* encoder parameters */ x264_param_t param; - x264_t *thread[X264_THREAD_MAX+1]; - x264_pthread_t thread_handle; - int b_thread_active; - int i_thread_phase; /* which thread to use for the next frame */ - int i_threadslice_start; /* first row in this thread slice */ - int i_threadslice_end; /* row after the end of this thread slice */ + x264_t *thread[X264_THREAD_MAX+1]; /* contexts for each frame in progress + lookahead */ + x264_pthread_t *thread_handle; + x264_pthread_cond_t thread_queue_cv; + x264_pthread_mutex_t thread_queue_mutex; + x264_t **thread_queue; /* frames that have been prepared but not yet claimed by a worker thread */ + x264_pthread_cond_t thread_active_cv; + x264_pthread_mutex_t thread_active_mutex; + int thread_active; + int b_thread_active; + int i_thread_phase; /* which thread to use for the next frame */ + int thread_exit; + int thread_error; + int i_threadslice_start; /* first row in this thread slice */ + int i_threadslice_end; /* row after the end of this thread slice */ /* bitstream output */ struct diff --git a/encoder/encoder.c b/encoder/encoder.c index 0c0010f..bc0e75b 100644 --- a/encoder/encoder.c +++ b/encoder/encoder.c @@ -47,6 +47,53 @@ static int x264_encoder_frame_end( x264_t *h, x264_t *thread_current, x264_nal_t **pp_nal, int *pi_nal, x264_picture_t *pic_out ); +/* threading */ + +static void *x264_slices_write_thread( x264_t *h ); + +#ifdef HAVE_PTHREAD +static void x264_int_cond_broadcast( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val ) +{ + x264_pthread_mutex_lock( mutex ); + *var = val; + x264_pthread_cond_broadcast( cv ); + x264_pthread_mutex_unlock( mutex ); +} + +static void x264_int_cond_wait( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val ) +{ + x264_pthread_mutex_lock( mutex ); + while( *var != val ) + x264_pthread_cond_wait( cv, mutex ); + x264_pthread_mutex_unlock( mutex ); +} + +#else +static void x264_int_cond_broadcast( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val ) +{} +static void x264_int_cond_wait( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val ) +{} +#endif + +static void x264_thread_pool_push( x264_t *h ) +{ + assert( h->thread_active == 0 ); + h->thread_active = 1; + assert( h->b_thread_active == 0 ); + h->b_thread_active = 1; + x264_pthread_mutex_lock( &h->thread[0]->thread_queue_mutex ); + x264_frame_push( (void*)h->thread_queue, (void*)h ); + x264_pthread_cond_broadcast( &h->thread[0]->thread_queue_cv ); + x264_pthread_mutex_unlock( &h->thread[0]->thread_queue_mutex ); +} + +static int x264_thread_pool_wait( x264_t *h ) +{ + x264_int_cond_wait( &h->thread_active_cv, &h->thread_active_mutex, &h->thread_active, 0 ); + h->b_thread_active = 0; + return h->thread_error; +} + /**************************************************************************** * ******************************* x264 libs ********************************** @@ -943,6 +990,16 @@ x264_t *x264_encoder_open( x264_param_t *param ) for( i = 1; i < h->param.i_threads + !!h->param.i_sync_lookahead; i++ ) CHECKED_MALLOC( h->thread[i], sizeof(x264_t) ); + if( h->param.i_threads > 1 ) + { + CHECKED_MALLOCZERO( h->thread_handle, (h->param.i_threads + 1) * sizeof(x264_pthread_t) ); + CHECKED_MALLOCZERO( h->thread_queue, (h->param.i_threads + 1) * sizeof(x264_t*) ); + if( x264_pthread_cond_init( &h->thread_queue_cv, NULL ) ) + goto fail; + if( x264_pthread_mutex_init( &h->thread_queue_mutex, NULL ) ) + goto fail; + } + if( x264_lookahead_init( h, i_slicetype_length ) ) goto fail; @@ -967,6 +1024,14 @@ x264_t *x264_encoder_open( x264_param_t *param ) CHECKED_MALLOC( h->thread[i]->out.nal, init_nal_count*sizeof(x264_nal_t) ); h->thread[i]->out.i_nals_allocated = init_nal_count; + if( h->param.i_threads > 1 ) + { + if( x264_pthread_cond_init( &h->thread[i]->thread_active_cv, NULL ) ) + goto fail; + if( x264_pthread_mutex_init( &h->thread[i]->thread_active_mutex, NULL ) ) + goto fail; + } + if( allocate_threadlocal_data && x264_macroblock_cache_init( h->thread[i] ) < 0 ) goto fail; } @@ -1009,6 +1074,13 @@ x264_t *x264_encoder_open( x264_param_t *param ) h->sps->i_profile_idc == PROFILE_HIGH ? "High" : "High 4:4:4 Predictive", h->sps->i_level_idc/10, h->sps->i_level_idc%10 ); + if( h->param.i_threads > 1 ) + { + for( i = 0; i < h->param.i_threads; i++ ) + if( x264_pthread_create( &h->thread_handle[i], NULL, (void*)x264_slices_write_thread, h ) ) + return NULL; + } + return h; fail: x264_free( h ); @@ -1723,7 +1795,7 @@ static int x264_slice_write( x264_t *h ) h->mb.b_reencode_mb = 0; #if VISUALIZE - if( h->param.b_visualize ) + if( h->i_threads == 1 && h->param.b_visualize ) x264_visualize_mb( h ); #endif @@ -1851,24 +1923,10 @@ static void x264_thread_sync_stat( x264_t *dst, x264_t *src ) memcpy( &dst->stat.i_frame_count, &src->stat.i_frame_count, sizeof(dst->stat) - sizeof(dst->stat.frame) ); } -static void *x264_slices_write( x264_t *h ) +static int x264_slices_write_internal( x264_t *h ) { int i_slice_num = 0; int last_thread_mb = h->sh.i_last_mb; - if( h->param.i_sync_lookahead ) - x264_lower_thread_priority( 10 ); - -#ifdef HAVE_MMX - /* Misalign mask has to be set separately for each thread. */ - if( h->param.cpu&X264_CPU_SSE_MISALIGN ) - x264_cpu_mask_misalign_sse(); -#endif - -#if VISUALIZE - if( h->param.b_visualize ) - if( x264_visualize_init( h ) ) - return (void *)-1; -#endif /* init stats */ memset( &h->stat.frame, 0, sizeof(h->stat.frame) ); @@ -1887,10 +1945,30 @@ static void *x264_slices_write( x264_t *h ) } h->sh.i_last_mb = X264_MIN( h->sh.i_last_mb, last_thread_mb ); if( x264_stack_align( x264_slice_write, h ) ) - return (void *)-1; + return -1; h->sh.i_first_mb = h->sh.i_last_mb + 1; } + return 0; +} + +static int x264_slices_write( x264_t *h ) +{ +#ifdef HAVE_MMX + /* Misalign mask has to be set separately for each thread. */ + if( h->param.cpu&X264_CPU_SSE_MISALIGN ) + x264_cpu_mask_misalign_sse(); +#endif + +#if VISUALIZE + if( h->param.b_visualize ) + if( x264_visualize_init( h ) ) + return -1; +#endif + + if( x264_slices_write_internal( h ) ) + return -1; + #if VISUALIZE if( h->param.b_visualize ) { @@ -1899,13 +1977,47 @@ static void *x264_slices_write( x264_t *h ) } #endif + return 0; +} + +static void *x264_slices_write_thread( x264_t *h ) +{ + if( h->param.i_sync_lookahead ) + x264_lower_thread_priority( 10 ); + +#ifdef HAVE_MMX + /* Misalign mask has to be set separately for each thread. */ + if( h->param.cpu&X264_CPU_SSE_MISALIGN ) + x264_cpu_mask_misalign_sse(); +#endif + + for(;;) + { + x264_t *t = NULL; + + // get one frame from the queue + x264_pthread_mutex_lock( &h->thread_queue_mutex ); + while( !h->thread_queue[0] && !h->thread_exit ) + x264_pthread_cond_wait( &h->thread_queue_cv, &h->thread_queue_mutex ); + if( h->thread_queue[0] ) + t = (void*)x264_frame_shift( (void*)h->thread_queue ); + x264_pthread_mutex_unlock( &h->thread_queue_mutex ); + if( h->thread_exit ) + return (void *)0; + if( !t ) + continue; + + t->thread_error = x264_slices_write_internal( t ); + + x264_int_cond_broadcast( &t->thread_active_cv, &t->thread_active_mutex, &t->thread_active, 0 ); + } + return (void *)0; } static int x264_threaded_slices_write( x264_t *h ) { int i, j; - void *ret = NULL; /* set first/last mb and sync contexts */ for( i = 0; i < h->param.i_threads; i++ ) { @@ -1928,14 +2040,10 @@ static int x264_threaded_slices_write( x264_t *h ) /* dispatch */ for( i = 0; i < h->param.i_threads; i++ ) - if( x264_pthread_create( &h->thread[i]->thread_handle, NULL, (void*)x264_slices_write, (void*)h->thread[i] ) ) - return -1; + x264_thread_pool_push( h->thread[i] ); for( i = 0; i < h->param.i_threads; i++ ) - { - x264_pthread_join( h->thread[i]->thread_handle, &ret ); - if( (intptr_t)ret ) - return (intptr_t)ret; - } + if( x264_thread_pool_wait( h->thread[i] ) ) + return -1; /* deblocking and hpel filtering */ for( i = 0; i <= h->sps->i_mb_height; i++ ) @@ -2238,18 +2346,14 @@ int x264_encoder_encode( x264_t *h, h->i_threadslice_start = 0; h->i_threadslice_end = h->sps->i_mb_height; if( !h->param.b_sliced_threads && h->param.i_threads > 1 ) - { - if( x264_pthread_create( &h->thread_handle, NULL, (void*)x264_slices_write, h ) ) - return -1; - h->b_thread_active = 1; - } + x264_thread_pool_push( h ); else if( h->param.b_sliced_threads ) { if( x264_threaded_slices_write( h ) ) return -1; } else - if( (intptr_t)x264_slices_write( h ) ) + if( x264_slices_write( h ) ) return -1; return x264_encoder_frame_end( thread_oldest, thread_current, pp_nal, pi_nal, pic_out ); @@ -2263,13 +2367,8 @@ static int x264_encoder_frame_end( x264_t *h, x264_t *thread_current, char psz_message[80]; if( h->b_thread_active ) - { - void *ret = NULL; - x264_pthread_join( h->thread_handle, &ret ); - if( (intptr_t)ret ) - return (intptr_t)ret; - h->b_thread_active = 0; - } + if( x264_thread_pool_wait( h ) ) + return -1; if( !h->out.i_nal ) { pic_out->i_type = X264_TYPE_AUTO; @@ -2472,15 +2571,29 @@ void x264_encoder_close ( x264_t *h ) x264_lookahead_delete( h ); - for( i = 0; i < h->param.i_threads; i++ ) + if( h->param.i_threads > 1 ) { // don't strictly have to wait for the other threads, but it's simpler than canceling them - if( h->thread[i]->b_thread_active ) + x264_pthread_mutex_lock( &h->thread_queue_mutex ); + h->thread_exit = 1; + x264_pthread_cond_broadcast( &h->thread_queue_cv ); + x264_pthread_mutex_unlock( &h->thread_queue_mutex ); + for( i = 0; i < h->param.i_threads; i++ ) + x264_pthread_join( h->thread_handle[i], NULL ); + for( i = 0; i < h->param.i_threads; i++ ) { - x264_pthread_join( h->thread[i]->thread_handle, NULL ); - assert( h->thread[i]->fenc->i_reference_count == 1 ); - x264_frame_delete( h->thread[i]->fenc ); + x264_pthread_cond_destroy( &h->thread[i]->thread_active_cv ); + x264_pthread_mutex_destroy( &h->thread[i]->thread_active_mutex ); + if( h->thread[i]->b_thread_active ) + { + assert( h->thread[i]->fenc->i_reference_count == 1 ); + x264_frame_delete( h->thread[i]->fenc ); + } } + x264_pthread_cond_destroy( &h->thread_queue_cv ); + x264_pthread_mutex_destroy( &h->thread_queue_mutex ); + x264_free( h->thread_handle ); + x264_free( h->thread_queue ); } if( h->param.i_threads > 1 && !h->param.b_sliced_threads ) diff --git a/encoder/lookahead.c b/encoder/lookahead.c index f33b167..039b9cb 100644 --- a/encoder/lookahead.c +++ b/encoder/lookahead.c @@ -152,7 +152,7 @@ int x264_lookahead_init( x264_t *h, int i_slicetype_length ) if( x264_macroblock_cache_init( look_h ) ) goto fail; - if( x264_pthread_create( &look_h->thread_handle, NULL, (void *)x264_lookahead_thread, look_h ) ) + if( x264_pthread_create( &h->thread_handle[h->param.i_threads], NULL, (void *)x264_lookahead_thread, look_h ) ) goto fail; look->b_thread_active = 1; @@ -170,7 +170,7 @@ void x264_lookahead_delete( x264_t *h ) h->lookahead->b_exit_thread = 1; x264_pthread_cond_broadcast( &h->lookahead->ifbuf.cv_fill ); x264_pthread_mutex_unlock( &h->lookahead->ifbuf.mutex ); - x264_pthread_join( h->thread[h->param.i_threads]->thread_handle, NULL ); + x264_pthread_join( h->thread_handle[h->param.i_threads], NULL ); x264_macroblock_cache_end( h->thread[h->param.i_threads] ); x264_free( h->thread[h->param.i_threads]->scratch_buffer ); x264_free( h->thread[h->param.i_threads] ); ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 10:11 ` Jason Garrett-Glaser @ 2009-12-18 12:49 ` Mike Galbraith 2009-12-18 13:06 ` Ingo Molnar 2009-12-18 13:53 ` Mike Galbraith 0 siblings, 2 replies; 34+ messages in thread From: Mike Galbraith @ 2009-12-18 12:49 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Ingo Molnar, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 2009-12-18 at 02:11 -0800, Jason Garrett-Glaser wrote: > Two more thoughts here: > > 1) We're considering moving to a thread pool soon; we already have a > working patch for it and if anything it'll save a few clocks spent on > nice()ing threads and other such things. Will this improve > START_DEBIT at all? Yeah, START_DEBIT only affects a thread once. > I've attached the beta patch if you want to try > it. Note this also works with 2) as well, so it adds yet another > dimension to what's mentioned below. > > 2) We recently implemented a new threading model which may be > interesting to test as well. This threading model gives worse > compression *and* performance, but has one benefit: it adds zero > latency, whereas normal threading adds a full frame of latency per > thread. This was paid for by a company interested in > ultra-low-latency streaming applications, where 1 millisecond is a > huge deal. I've been thinking this might be interesting to bench from > a kernel perspective as well, as when you're spawning a half-dozen > threads and need them all done within 6 milliseconds, you start > getting down to serious scheduler issues. > > The new threading model is much less complex than the regular one and > works as follows. The frame is split into X slices, and each slice > encoded with one thread. Specifically, it works via the following > process: > > 1. Preprocess input frame, perform lookahead analysis on input frame > (all singlethreaded) > 2. Split up a ton of threads to do the main encode, one per slice. > 3. Join all the threads. > 4. Do post-filtering on the output frame, return. > > Clearly this is an utter disaster, since it spawns N times as many > threads as the old threading model *and* they last far shorter, *and* > only part of the application is multithreaded. But there's not really > a better way to do low-latency threading, and it's an interesting > challenge to boot. IIRC, it's also the way ffmpeg's encoder threading > works. It's widely considered an inferior model, but as mentioned > before, in this particular use-case there's no choice. > > To enable this, use --sliced-threads. I'd recommend using a > higher-resolution clip for this, as it performs atrociously bad on > very low resolution videos for reasons you might be able to guess. If > you need a higher-res clip, check the SD or HD ones here: > http://media.xiph.org/video/derf/ . In another 8 hrs 24 min, I'll have a sunflower to stare at. > I'm personally curious as to what kind of scheduler issues this > results in--I haven't done any BFS vs CFS tests with this option > enabled yet. I'll look for x264 source, and patch/piddle. -Mike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 12:49 ` Mike Galbraith @ 2009-12-18 13:06 ` Ingo Molnar 2009-12-18 13:36 ` Mike Galbraith 2009-12-18 13:53 ` Mike Galbraith 1 sibling, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2009-12-18 13:06 UTC (permalink / raw) To: Mike Galbraith Cc: Jason Garrett-Glaser, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist, Linus Torvalds * Mike Galbraith <efault@gmx.de> wrote: > > I'm personally curious as to what kind of scheduler issues this results > > in--I haven't done any BFS vs CFS tests with this option enabled yet. > > I'll look for x264 source, and patch/piddle. btw., would be nice to look at it via tools/perf/ as well: perf stat --repeat 3 ... to see the basic hardware utilization (cycles/cache-misses, branch execution rate, instructions, etc.) and the basic parallelism metrics, at a glance. i suspect "perf stat -e L1-icache-loads -e L1-icache-load-misses" would give us an even more detailed picture. Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 13:06 ` Ingo Molnar @ 2009-12-18 13:36 ` Mike Galbraith 0 siblings, 0 replies; 34+ messages in thread From: Mike Galbraith @ 2009-12-18 13:36 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Garrett-Glaser, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 2009-12-18 at 14:06 +0100, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > > > I'm personally curious as to what kind of scheduler issues this results > > > in--I haven't done any BFS vs CFS tests with this option enabled yet. > > > > I'll look for x264 source, and patch/piddle. > > btw., would be nice to look at it via tools/perf/ as well: > > perf stat --repeat 3 ... > > to see the basic hardware utilization (cycles/cache-misses, branch execution > rate, instructions, etc.) and the basic parallelism metrics, at a glance. > > i suspect "perf stat -e L1-icache-loads -e L1-icache-load-misses" would give > us an even more detailed picture. Almost virgin v2.6.32-10468-g020307d running 'medium'. encoded 600 frames, 36.52 fps, 13003.54 kb/s Performance counter stats for './x264.sh 8' (3 runs): 63742.218844 task-clock-msecs # 3.870 CPUs ( +- 0.016% ) 42593 context-switches # 0.001 M/sec ( +- 0.487% ) 3011 CPU-migrations # 0.000 M/sec ( +- 0.417% ) 12862 page-faults # 0.000 M/sec ( +- 0.004% ) 151734450892 cycles # 2380.439 M/sec ( +- 1.947% ) (scaled from 71.44%) 205642315207 instructions # 1.355 IPC ( +- 0.085% ) (scaled from 80.68%) 16274905932 branches # 255.324 M/sec ( +- 0.080% ) (scaled from 80.67%) 1257135617 branch-misses # 7.724 % ( +- 0.255% ) (scaled from 80.06%) 3116653323 cache-references # 48.895 M/sec ( +- 0.340% ) (scaled from 23.78%) 50823973 cache-misses # 0.797 M/sec ( +- 1.400% ) (scaled from 23.76%) 16.470164901 seconds time elapsed ( +- 0.079% ) encoded 600 frames, 36.58 fps, 13003.54 kb/s Performance counter stats for './x264.sh 8' (3 runs): 133692266953 L1-icache-loads ( +- 0.027% ) 997371592 L1-icache-load-misses ( +- 0.009% ) 16.407060367 seconds time elapsed ( +- 0.036% ) ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 12:49 ` Mike Galbraith 2009-12-18 13:06 ` Ingo Molnar @ 2009-12-18 13:53 ` Mike Galbraith 1 sibling, 0 replies; 34+ messages in thread From: Mike Galbraith @ 2009-12-18 13:53 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Ingo Molnar, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 2009-12-18 at 13:49 +0100, Mike Galbraith wrote: > I'll look for x264 source, and patch/piddle. encoder/encoder.c: In function ‘x264_slice_write’: encoder/encoder.c:1813: error: ‘x264_t’ has no member named ‘i_threads’ make: *** [encoder/encoder.o] Error 1 marge:..src/x264 # git remote -v origin git://git.videolan.org/x264.git (fetch) origin git://git.videolan.org/x264.git (push) ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 7:30 ` Mike Galbraith 2009-12-18 10:11 ` Jason Garrett-Glaser @ 2009-12-18 10:57 ` Kasper Sandberg 2009-12-18 11:05 ` Jason Garrett-Glaser 1 sibling, 1 reply; 34+ messages in thread From: Kasper Sandberg @ 2009-12-18 10:57 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote: > On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: > > > Having said that, we generally try to make things perform well without apps > > having to switch themselves to SCHED_BATCH. Mike, do you think we can make > > x264 perform as well (or nearly as well) under SCHED_OTHER as under > > SCHED_BATCH? > > It's not bad as is, except for ultrafast mode. START_DEBIT is the > biggest problem there. I don't think SCHED_OTHER will ever match > SCHED_BATCH for this load, though I must say I haven't full-spectrum > tested. This load really wants RR scheduling, and wakeup preemption > necessarily perturbs run order. > > I'll probably piddle with it some more, it's an interesting load. Yes, i must say, very interresting, its very complicated and... oh wait, its just encoding a movie! > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 10:57 ` Kasper Sandberg @ 2009-12-18 11:05 ` Jason Garrett-Glaser 2009-12-19 1:08 ` Con Kolivas 0 siblings, 1 reply; 34+ messages in thread From: Jason Garrett-Glaser @ 2009-12-18 11:05 UTC (permalink / raw) To: Kasper Sandberg Cc: Mike Galbraith, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote: >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: >> >> > Having said that, we generally try to make things perform well without apps >> > having to switch themselves to SCHED_BATCH. Mike, do you think we can make >> > x264 perform as well (or nearly as well) under SCHED_OTHER as under >> > SCHED_BATCH? >> >> It's not bad as is, except for ultrafast mode. START_DEBIT is the >> biggest problem there. I don't think SCHED_OTHER will ever match >> SCHED_BATCH for this load, though I must say I haven't full-spectrum >> tested. This load really wants RR scheduling, and wakeup preemption >> necessarily perturbs run order. >> >> I'll probably piddle with it some more, it's an interesting load. > Yes, i must say, very interresting, its very complicated and... oh wait, > its just encoding a movie! Your trolling is becoming a bit over-the-top at this point. You should also considering replying to multiple people in one email as opposed to spamming a whole bunch in sequence. Perhaps as the lead x264 developer I'm qualified to say that it certainly is a very complicated load due to the strict ordering requirements of the threading model--and that you should tone down the whining just a tad and perhaps read a bit more about how BFS and CFS work before complaining about them. Jason ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 11:05 ` Jason Garrett-Glaser @ 2009-12-19 1:08 ` Con Kolivas 2009-12-19 4:03 ` Mike Galbraith 0 siblings, 1 reply; 34+ messages in thread From: Con Kolivas @ 2009-12-19 1:08 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Kasper Sandberg, Mike Galbraith, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote: > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote: > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: > >> > Having said that, we generally try to make things perform well without > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as > >> > under SCHED_BATCH? > >> > >> It's not bad as is, except for ultrafast mode. START_DEBIT is the > >> biggest problem there. I don't think SCHED_OTHER will ever match > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum > >> tested. This load really wants RR scheduling, and wakeup preemption > >> necessarily perturbs run order. > >> > >> I'll probably piddle with it some more, it's an interesting load. > > > > Yes, i must say, very interresting, its very complicated and... oh wait, > > its just encoding a movie! > > Your trolling is becoming a bit over-the-top at this point. You > should also considering replying to multiple people in one email as > opposed to spamming a whole bunch in sequence. > > Perhaps as the lead x264 developer I'm qualified to say that it > certainly is a very complicated load due to the strict ordering > requirements of the threading model--and that you should tone down the > whining just a tad and perhaps read a bit more about how BFS and CFS > work before complaining about them. Your workload is interesting because it is a well written real world application with a solid threading model written in a cross platform portable way. Your code is valuable as a measure for precisely this reason, and there's a trap in trying to program in a way that "the scheduler might like". That's presumably what Kasper is trying to point out, albeit in a much blunter fashion. The only workloads I'm remotely interested in are real world workloads involving real applications like yours, software compilation, video playback, audio playback, gaming, apache page serving, mysql performance and so on that people in the real world use on real hardware all day every day. These are, of course, measurable even above and beyond the elusive and impossible to measure and quantify interactivity and responsiveness. I couldn't care less about some artificial benchmark involving LTP, timing mplayer playing in the presence of 100,000 pipes, volanomark which is just a sched_yield benchmark, dbench and hackbench which even their original programmers don't like them being used as a meaningful measure, and so on, and normal users should also not care about the values returned by these artificial benchmarks when they bear no resemblance to their real world performance cases as above. I have zero interest in adding any "tweaks" to BFS to perform well in X benchmark, for there be a path where dragons lie. I've always maintained that, and still stick to it, that the more tweaks you add for corner cases, the more corner cases you introduce yourself. BFS will remain for a targeted audience and I care not to appeal to any artificial benchmarketing obsessed population that drives mainline, since I don't -have- to. Mainline can do what it wants, and hopefully uses BFS as a yardstick for comparison when appropriate. Regards, -- -ck ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-19 1:08 ` Con Kolivas @ 2009-12-19 4:03 ` Mike Galbraith 2009-12-19 17:36 ` Kasper Sandberg 0 siblings, 1 reply; 34+ messages in thread From: Mike Galbraith @ 2009-12-19 4:03 UTC (permalink / raw) To: Con Kolivas Cc: Jason Garrett-Glaser, Kasper Sandberg, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote: > On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote: > > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote: > > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: > > >> > Having said that, we generally try to make things perform well without > > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we > > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as > > >> > under SCHED_BATCH? > > >> > > >> It's not bad as is, except for ultrafast mode. START_DEBIT is the > > >> biggest problem there. I don't think SCHED_OTHER will ever match > > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum > > >> tested. This load really wants RR scheduling, and wakeup preemption > > >> necessarily perturbs run order. > > >> > > >> I'll probably piddle with it some more, it's an interesting load. > > > > > > Yes, i must say, very interresting, its very complicated and... oh wait, > > > its just encoding a movie! > > > > Your trolling is becoming a bit over-the-top at this point. You > > should also considering replying to multiple people in one email as > > opposed to spamming a whole bunch in sequence. > > > > Perhaps as the lead x264 developer I'm qualified to say that it > > certainly is a very complicated load due to the strict ordering > > requirements of the threading model--and that you should tone down the > > whining just a tad and perhaps read a bit more about how BFS and CFS > > work before complaining about them. > > Your workload is interesting because it is a well written real world > application with a solid threading model written in a cross platform portable > way. Your code is valuable as a measure for precisely this reason, and > there's a trap in trying to program in a way that "the scheduler might like". > That's presumably what Kasper is trying to point out, albeit in a much blunter > fashion. If using a different kernel facility gives better results, go for what works best. Programmers have been doing that since day one. I doubt you'd call it a trap to trade a pipe for a socketpair if one produced better results than the other. Mind you, we should be able to better service the load with plain SCHED_OTHER, no argument there. > The only workloads I'm remotely interested in are real world workloads > involving real applications like yours, software compilation, video playback, > audio playback, gaming, apache page serving, mysql performance and so on that > people in the real world use on real hardware all day every day. These are, of > course, measurable even above and beyond the elusive and impossible to measure > and quantify interactivity and responsiveness. > > I couldn't care less about some artificial benchmark involving LTP, timing > mplayer playing in the presence of 100,000 pipes, volanomark which is just a > sched_yield benchmark, dbench and hackbench which even their original > programmers don't like them being used as a meaningful measure, and so on, and > normal users should also not care about the values returned by these artificial > benchmarks when they bear no resemblance to their real world performance cases > as above. I find all programs interesting and valid in their own right, whether they be a benchmark or not, though I agree that vmark and hackbench are a bit over the top. > I have zero interest in adding any "tweaks" to BFS to perform well in X > benchmark, for there be a path where dragons lie. I've always maintained that, > and still stick to it, that the more tweaks you add for corner cases, the more > corner cases you introduce yourself. BFS will remain for a targeted audience > and I care not to appeal to any artificial benchmarketing obsessed population > that drives mainline, since I don't -have- to. Mainline can do what it wants, > and hopefully uses BFS as a yardstick for comparison when appropriate. Interesting rant. IMO, benchmarks are all merely programs that do some work and quantify. Whether you like what they measure or not, whether they emit flattering numbers or not, they can all tell you something if you're willing to listen. Oh, and for the record, timing mplayer thing was NOT in the presence of 100000 pipes, it was in the presence of one cpu hog, as was the time amarok loading thing. Those were UP tests showing you a weakness. All of the results I sent you were intended to show you areas that could use some improvement, but you don't want to hear, so label and hand-wave. Below is a quote of the results I sent you. <quote> I've taken BFS out for a few spins while looking into BFS vs CFS latency reports, and noticed a couple problems I'll share, comparison testing has been healthy for CFS, so maybe BFS can profit as well. Below are some bfs304 vs my working tree numbers from a run this morning, looking to see if some issues seen in earlier releases were still present. Comments on noted issues: It looks like there may be some affinity troubles, and there definitely seems to be a fairness bug still lurking. No idea what's up with that, but see data below, it's pretty nasty. Any sleepy load competing with a pure hog seems to be troublesome. The pgsql+oltp test data is very interesting to me, pgsql+oltp hates preemption with a passion, because of it's USERLAND spinlocks. Preempt the lock holder, and watch the fun. Your preemption model suits it very well at the low end, and does pretty well all the way though. Really interesting to me is the difference in 1 and 2 client throughput, why I'm including these. msql+oltp and tbench look like they're griping about affinity to me, but I haven't instrumented anything, so can't be sure. mysql+oltp I know is a wakeup preemption and is very affinity sensitive. Too little wakeup preemption, it suffers, any load balancing, it suffers. What vmark is so upset about, I have no idea. I know it's very affinity sensitive, and hates wakeup preemption passionately. Numbers: vmark tip 108841 messages per second tip++ 116260 messages per second 31.bfs304 28279 messages per second tbench 8 tip 938.421 MB/sec 8 procs tip++ 952.302 MB/sec 8 procs 31.bfs304 709.121 MB/sec 8 procs mysql+oltp clients 1 2 4 8 16 32 64 128 256 tip 9999.36 18493.54 34652.91 34253.13 32057.64 30297.43 28300.96 25450.14 20675.99 tip++ 10041.16 18531.16 34934.22 34192.65 32829.65 32010.55 30341.31 27340.65 22724.87 31.bfs304 9459.85 14952.44 32209.07 29724.03 28608.02 27051.10 24851.44 21223.15 15809.46 pgsql+oltp clients 1 2 4 8 16 32 64 128 256 tip 13577.63 26510.67 51871.05 51374.62 50190.69 45494.64 37173.83 27767.09 22795.23 tip++ 13685.69 26693.42 52056.45 51733.30 50854.75 49790.95 48972.02 47517.34 44999.22 31.bfs304 15467.03 21126.57 52673.76 50972.41 49652.54 46015.73 44567.18 40419.90 33276.67 fairness bug in 31.bfs304? prep: set CPU governor to performance first, as in all benchmarking. taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy) taskset -p 0x1 `pidof Xorg` perf stat taskset -c 0 konsole -e exit 31.bfs304 2.073724549 seconds time elapsed tip++ 0.989323860 seconds time elapsed note: amarok pins itself to CPU0, and is set up to use mysql database. prep: cache warmup run. perf stat amarokapp (quit after 12000 song mp3 collection is loaded) 31.bfs304 136.418518486 seconds time elapsed tip++ 19.439268066 seconds time elapsed prep: restart amarok, wait for load, start playing perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie) 31.bfs304 432.712500554 seconds time elapsed tip++ 363.622519583 seconds time elapsed ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-19 4:03 ` Mike Galbraith @ 2009-12-19 17:36 ` Kasper Sandberg 2009-12-19 20:57 ` Mike Galbraith 2009-12-20 3:22 ` Andres Freund 0 siblings, 2 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-19 17:36 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sat, 2009-12-19 at 05:03 +0100, Mike Galbraith wrote: > On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote: > > On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote: > > > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote: > > > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote: > > > >> > Having said that, we generally try to make things perform well without > > > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we > > > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as > > > >> > under SCHED_BATCH? > > > >> > > > >> It's not bad as is, except for ultrafast mode. START_DEBIT is the > > > >> biggest problem there. I don't think SCHED_OTHER will ever match > > > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum > > > >> tested. This load really wants RR scheduling, and wakeup preemption > > > >> necessarily perturbs run order. > > > >> > > > >> I'll probably piddle with it some more, it's an interesting load. > > > > > > > > Yes, i must say, very interresting, its very complicated and... oh wait, > > > > its just encoding a movie! > > > > > > Your trolling is becoming a bit over-the-top at this point. You > > > should also considering replying to multiple people in one email as > > > opposed to spamming a whole bunch in sequence. > > > > > > Perhaps as the lead x264 developer I'm qualified to say that it > > > certainly is a very complicated load due to the strict ordering > > > requirements of the threading model--and that you should tone down the > > > whining just a tad and perhaps read a bit more about how BFS and CFS > > > work before complaining about them. > > > > Your workload is interesting because it is a well written real world > > application with a solid threading model written in a cross platform portable > > way. Your code is valuable as a measure for precisely this reason, and > > there's a trap in trying to program in a way that "the scheduler might like". > > That's presumably what Kasper is trying to point out, albeit in a much blunter > > fashion. > > If using a different kernel facility gives better results, go for what > works best. Programmers have been doing that since day one. I doubt > you'd call it a trap to trade a pipe for a socketpair if one produced > better results than the other. Ofcourse in this case that is what performs best one a single scheduler... > > Mind you, we should be able to better service the load with plain > SCHED_OTHER, no argument there. Great, so when you said "i dont think it will get better"(or words to that effect), that didnt mean anything? > > > The only workloads I'm remotely interested in are real world workloads > > involving real applications like yours, software compilation, video playback, > > audio playback, gaming, apache page serving, mysql performance and so on that > > people in the real world use on real hardware all day every day. These are, of > > course, measurable even above and beyond the elusive and impossible to measure > > and quantify interactivity and responsiveness. > > > > I couldn't care less about some artificial benchmark involving LTP, timing > > mplayer playing in the presence of 100,000 pipes, volanomark which is just a > > sched_yield benchmark, dbench and hackbench which even their original > > programmers don't like them being used as a meaningful measure, and so on, and > > normal users should also not care about the values returned by these artificial > > benchmarks when they bear no resemblance to their real world performance cases > > as above. > > I find all programs interesting and valid in their own right, whether > they be a benchmark or not, though I agree that vmark and hackbench are > a bit over the top. Yes.. its interresting to SEE, whether its relevant and something to care about is entirely different. Yes, its very interresting that something craps out, now, this thing is _NEVER_ going to occur in real life, and if it happens to do by some magical christmas fluke, then that is fortunately only ONE time you're seeing that problem, and as such, its irellevant, and certainly doesnt merit workarounds which makes other very common stuff perform significantly worse. > > > I have zero interest in adding any "tweaks" to BFS to perform well in X > > benchmark, for there be a path where dragons lie. I've always maintained that, > > and still stick to it, that the more tweaks you add for corner cases, the more > > corner cases you introduce yourself. BFS will remain for a targeted audience > > and I care not to appeal to any artificial benchmarketing obsessed population > > that drives mainline, since I don't -have- to. Mainline can do what it wants, > > and hopefully uses BFS as a yardstick for comparison when appropriate. > > Interesting rant. IMO, benchmarks are all merely programs that do some > work and quantify. Whether you like what they measure or not, whether > they emit flattering numbers or not, they can all tell you something if > you're willing to listen. I suspect con is very interrested in listening, however, as he have stated, if fixing some corner case in an artificial load requires damaging a realworld load, that is an unacceptable modification to him, and I agree. I ask you this, would you rather some artificial benchmark ran better, but your own everyday applications ran slower as a result? It seems to me you do, which i can not understand. > > Oh, and for the record, timing mplayer thing was NOT in the presence of > 100000 pipes, it was in the presence of one cpu hog, as was the time > amarok loading thing. Those were UP tests showing you a weakness. All > of the results I sent you were intended to show you areas that could use > some improvement, but you don't want to hear, so label and hand-wave. > > Below is a quote of the results I sent you. > > <quote> > > I've taken BFS out for a few spins while looking into BFS vs CFS latency > reports, and noticed a couple problems I'll share, comparison testing > has been healthy for CFS, so maybe BFS can profit as well. Below are > some bfs304 vs my working tree numbers from a run this morning, looking > to see if some issues seen in earlier releases were still present. > > Comments on noted issues: > > It looks like there may be some affinity troubles, and there definitely > seems to be a fairness bug still lurking. No idea what's up with that, > but see data below, it's pretty nasty. Any sleepy load competing with a > pure hog seems to be troublesome. > > The pgsql+oltp test data is very interesting to me, pgsql+oltp hates > preemption with a passion, because of it's USERLAND spinlocks. Preempt > the lock holder, and watch the fun. Your preemption model suits it very > well at the low end, and does pretty well all the way though. Really > interesting to me is the difference in 1 and 2 client throughput, why > I'm including these. > > msql+oltp and tbench look like they're griping about affinity to me, but > I haven't instrumented anything, so can't be sure. mysql+oltp I know is > a wakeup preemption and is very affinity sensitive. Too little wakeup > preemption, it suffers, any load balancing, it suffers. > > What vmark is so upset about, I have no idea. I know it's very affinity > sensitive, and hates wakeup preemption passionately. > > Numbers: > > vmark > tip 108841 messages per second > tip++ 116260 messages per second > 31.bfs304 28279 messages per second > > tbench 8 > tip 938.421 MB/sec 8 procs > tip++ 952.302 MB/sec 8 procs > 31.bfs304 709.121 MB/sec 8 procs > > mysql+oltp > clients 1 2 4 8 16 32 64 128 256 > tip 9999.36 18493.54 34652.91 34253.13 32057.64 30297.43 28300.96 25450.14 20675.99 > tip++ 10041.16 18531.16 34934.22 34192.65 32829.65 32010.55 30341.31 27340.65 22724.87 > 31.bfs304 9459.85 14952.44 32209.07 29724.03 28608.02 27051.10 24851.44 21223.15 15809.46 > > pgsql+oltp > clients 1 2 4 8 16 32 64 128 256 > tip 13577.63 26510.67 51871.05 51374.62 50190.69 45494.64 37173.83 27767.09 22795.23 > tip++ 13685.69 26693.42 52056.45 51733.30 50854.75 49790.95 48972.02 47517.34 44999.22 > 31.bfs304 15467.03 21126.57 52673.76 50972.41 49652.54 46015.73 44567.18 40419.90 33276.67 > > fairness bug in 31.bfs304? > > prep: > set CPU governor to performance first, as in all benchmarking. > taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy) > taskset -p 0x1 `pidof Xorg` > > perf stat taskset -c 0 konsole -e exit > 31.bfs304 2.073724549 seconds time elapsed > tip++ 0.989323860 seconds time elapsed > > note: amarok pins itself to CPU0, and is set up to use mysql database. > > prep: cache warmup run. > perf stat amarokapp (quit after 12000 song mp3 collection is loaded) > > 31.bfs304 136.418518486 seconds time elapsed > tip++ 19.439268066 seconds time elapsed > > prep: restart amarok, wait for load, start playing > > perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie) > 31.bfs304 432.712500554 seconds time elapsed > tip++ 363.622519583 seconds time elapsed > But presumably the cpu hog is running at same priority, and if this is done on a UP system, that will obviously mean fairness will make stuff slower.. Try this on a dualcore or quadcore system, or ofcourse just set the niceness accordingly... > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-19 17:36 ` Kasper Sandberg @ 2009-12-19 20:57 ` Mike Galbraith 2009-12-20 3:22 ` Andres Freund 1 sibling, 0 replies; 34+ messages in thread From: Mike Galbraith @ 2009-12-19 20:57 UTC (permalink / raw) To: Kasper Sandberg Cc: Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sat, 2009-12-19 at 18:36 +0100, Kasper Sandberg wrote: > On Sat, 2009-12-19 at 05:03 +0100, Mike Galbraith wrote: > > On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote: > > > Your workload is interesting because it is a well written real world > > > application with a solid threading model written in a cross platform portable > > > way. Your code is valuable as a measure for precisely this reason, and > > > there's a trap in trying to program in a way that "the scheduler might like". > > > That's presumably what Kasper is trying to point out, albeit in a much blunter > > > fashion. > > > > If using a different kernel facility gives better results, go for what > > works best. Programmers have been doing that since day one. I doubt > > you'd call it a trap to trade a pipe for a socketpair if one produced > > better results than the other. > > Ofcourse in this case that is what performs best one a single > scheduler... I have no idea what you're talking about here. > > Mind you, we should be able to better service the load with plain > > SCHED_OTHER, no argument there. > Great, so when you said "i dont think it will get better"(or words to > that effect), that didnt mean anything? Or here. Look. BFS handles this load well, a little better than CFS in fact. I don't have a problem with that, but you seem to think it's a big hairy deal for some strange reason. > > > The only workloads I'm remotely interested in are real world workloads > > > involving real applications like yours, software compilation, video playback, > > > audio playback, gaming, apache page serving, mysql performance and so on that > > > people in the real world use on real hardware all day every day. These are, of > > > course, measurable even above and beyond the elusive and impossible to measure > > > and quantify interactivity and responsiveness. > > > > > > I couldn't care less about some artificial benchmark involving LTP, timing > > > mplayer playing in the presence of 100,000 pipes, volanomark which is just a > > > sched_yield benchmark, dbench and hackbench which even their original > > > programmers don't like them being used as a meaningful measure, and so on, and > > > normal users should also not care about the values returned by these artificial > > > benchmarks when they bear no resemblance to their real world performance cases > > > as above. > > > > I find all programs interesting and valid in their own right, whether > > they be a benchmark or not, though I agree that vmark and hackbench are > > a bit over the top. > > Yes.. its interresting to SEE, whether its relevant and something to > care about is entirely different. > > Yes, its very interresting that something craps out, now, this thing is > _NEVER_ going to occur in real life, and if it happens to do by some > magical christmas fluke, then that is fortunately only ONE time you're > seeing that problem, and as such, its irellevant, and certainly doesnt > merit workarounds which makes other very common stuff perform > significantly worse. Haven't you noticed yet that nobody but you and Con has suggested any course of action whatsoever? That it is you two who both mention then condemn workarounds and load specific tweaks all in the same breath with not one word having come from any other source? > > > I have zero interest in adding any "tweaks" to BFS to perform well in X > > > benchmark, for there be a path where dragons lie. I've always maintained that, > > > and still stick to it, that the more tweaks you add for corner cases, the more > > > corner cases you introduce yourself. BFS will remain for a targeted audience > > > and I care not to appeal to any artificial benchmarketing obsessed population > > > that drives mainline, since I don't -have- to. Mainline can do what it wants, > > > and hopefully uses BFS as a yardstick for comparison when appropriate. > > > > Interesting rant. IMO, benchmarks are all merely programs that do some > > work and quantify. Whether you like what they measure or not, whether > > they emit flattering numbers or not, they can all tell you something if > > you're willing to listen. > > I suspect con is very interrested in listening, however, as he have > stated, if fixing some corner case in an artificial load requires > damaging a realworld load, that is an unacceptable modification to him, > and I agree. I ask you this, would you rather some artificial benchmark > ran better, but your own everyday applications ran slower as a result? > It seems to me you do, which i can not understand. You can hand-wave all you want, I really do not care, but kindly keep your words out of my mouth. > > fairness bug in 31.bfs304? > > > > prep: > > set CPU governor to performance first, as in all benchmarking. > > taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy) > > taskset -p 0x1 `pidof Xorg` > > > > perf stat taskset -c 0 konsole -e exit > > 31.bfs304 2.073724549 seconds time elapsed > > tip++ 0.989323860 seconds time elapsed > > > > note: amarok pins itself to CPU0, and is set up to use mysql database. > > > > prep: cache warmup run. > > perf stat amarokapp (quit after 12000 song mp3 collection is loaded) > > > > 31.bfs304 136.418518486 seconds time elapsed > > tip++ 19.439268066 seconds time elapsed > > > > prep: restart amarok, wait for load, start playing > > > > perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie) > > 31.bfs304 432.712500554 seconds time elapsed > > tip++ 363.622519583 seconds time elapsed > > > > But presumably the cpu hog is running at same priority, and if this is > done on a UP system, that will obviously mean fairness will make stuff > slower.. > > Try this on a dualcore or quadcore system, or ofcourse just set the > niceness accordingly... Amazing that you can actually say that with a straight face. Look. You can hand-wave all results into irrelevance, I do not care. You've both made it perfectly clear that test results are not welcome. -Mike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-19 17:36 ` Kasper Sandberg 2009-12-19 20:57 ` Mike Galbraith @ 2009-12-20 3:22 ` Andres Freund 2009-12-20 12:10 ` Kasper Sandberg 1 sibling, 1 reply; 34+ messages in thread From: Andres Freund @ 2009-12-20 3:22 UTC (permalink / raw) To: Kasper Sandberg Cc: Mike Galbraith, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote: > Try this on a dualcore or quadcore system, or ofcourse just set the< > niceness accordingly... Oh well. This is getting too much for a normally very silent and flame fearing reader. Didnt *you* just tell others to shut up about using any tunables for any application? And that you dont need any tunables for BFS? Andres ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-20 3:22 ` Andres Freund @ 2009-12-20 12:10 ` Kasper Sandberg 2009-12-20 13:09 ` Kasper Sandberg 2009-12-20 15:13 ` Mike Galbraith 0 siblings, 2 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-20 12:10 UTC (permalink / raw) To: Andres Freund Cc: Mike Galbraith, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sun, 2009-12-20 at 04:22 +0100, Andres Freund wrote: > On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote: > > Try this on a dualcore or quadcore system, or ofcourse just set the< > > niceness accordingly... > Oh well. This is getting too much for a normally very silent and flame fearing > reader. Didnt *you* just tell others to shut up about using any tunables for > any application? And that you dont need any tunables for BFS? That was an entirely different case, have you even been following the thread? OFCOURSE you're going to see slowdowns on a UP system if you have a cpu hog and then run something else, this is the only behavior possible, and bfs handles it in a fair way. when i said we needed no tunables, that was for running a _SINGLE_ application, and then measuring said applications performance. (where BFS indeed does beat CFS by a quite large margin) and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called "completely FAIR scheduler" ? or is that just the marketing name? > > Andres ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-20 12:10 ` Kasper Sandberg @ 2009-12-20 13:09 ` Kasper Sandberg 2009-12-20 15:13 ` Mike Galbraith 1 sibling, 0 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-20 13:09 UTC (permalink / raw) To: Andres Freund Cc: Mike Galbraith, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote: > On Sun, 2009-12-20 at 04:22 +0100, Andres Freund wrote: > > On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote: > > > Try this on a dualcore or quadcore system, or ofcourse just set the< > > > niceness accordingly... > > Oh well. This is getting too much for a normally very silent and flame fearing > > reader. Didnt *you* just tell others to shut up about using any tunables for > > any application? And that you dont need any tunables for BFS? oh and btw, the niceness is not really a tunable" > > That was an entirely different case, have you even been following the > thread? > > OFCOURSE you're going to see slowdowns on a UP system if you have a cpu > hog and then run something else, this is the only behavior possible, and > bfs handles it in a fair way. > > when i said we needed no tunables, that was for running a _SINGLE_ > application, and then measuring said applications performance. (where > BFS indeed does beat CFS by a quite large margin) > > and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called > "completely FAIR scheduler" ? or is that just the marketing name? > > > > > > > Andres > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-20 12:10 ` Kasper Sandberg 2009-12-20 13:09 ` Kasper Sandberg @ 2009-12-20 15:13 ` Mike Galbraith 2009-12-20 15:51 ` Mike Galbraith 1 sibling, 1 reply; 34+ messages in thread From: Mike Galbraith @ 2009-12-20 15:13 UTC (permalink / raw) To: Kasper Sandberg Cc: Andres Freund, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote: > and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called > "completely FAIR scheduler" ? or is that just the marketing name? Clue: CFS _did_ distribute CPU evenly. Ponder that for a moment. -Mike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-20 15:13 ` Mike Galbraith @ 2009-12-20 15:51 ` Mike Galbraith 2009-12-22 7:33 ` Jason Garrett-Glaser 0 siblings, 1 reply; 34+ messages in thread From: Mike Galbraith @ 2009-12-20 15:51 UTC (permalink / raw) To: Kasper Sandberg Cc: Andres Freund, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Sun, 2009-12-20 at 16:13 +0100, Mike Galbraith wrote: > On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote: > > > and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called > > "completely FAIR scheduler" ? or is that just the marketing name? > > Clue: CFS _did_ distribute CPU evenly. Ponder that for a moment. All done? Do you think THAT may be why I thought Con might be interested?!? -Mike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-20 15:51 ` Mike Galbraith @ 2009-12-22 7:33 ` Jason Garrett-Glaser 2009-12-22 7:39 ` Jason Garrett-Glaser 0 siblings, 1 reply; 34+ messages in thread From: Jason Garrett-Glaser @ 2009-12-22 7:33 UTC (permalink / raw) To: Mike Galbraith Cc: Kasper Sandberg, Andres Freund, Con Kolivas, Ingo Molnar, Peter Zijlstra, LKML Mailinglist Benchmarks for the new threading model are up, along with a few others: http://doom10.org/index.php?topic=78.0 Interestingly enough, CFS beats BFS on zerolatency by a significant margin. Unsurprisingly, given the threading model, the optimal number of threads is equal to the number of cores. Jason ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-22 7:33 ` Jason Garrett-Glaser @ 2009-12-22 7:39 ` Jason Garrett-Glaser 0 siblings, 0 replies; 34+ messages in thread From: Jason Garrett-Glaser @ 2009-12-22 7:39 UTC (permalink / raw) To: Mike Galbraith Cc: Kasper Sandberg, Andres Freund, Con Kolivas, Ingo Molnar, Peter Zijlstra, LKML Mailinglist On Tue, Dec 22, 2009 at 2:33 AM, Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > Benchmarks for the new threading model are up, along with a few others: > > http://doom10.org/index.php?topic=78.0 > > Interestingly enough, CFS beats BFS on zerolatency by a significant > margin. Unsurprisingly, given the threading model, the optimal number > of threads is equal to the number of cores. > > Jason > And I am apparently blind: I cannot read graphs. Ignore the conclusion made in the above post ;) Jason ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: x264 benchmarks BFS vs CFS 2009-12-18 1:18 ` Jason Garrett-Glaser 2009-12-18 5:23 ` Ingo Molnar @ 2009-12-18 10:56 ` Kasper Sandberg 1 sibling, 0 replies; 34+ messages in thread From: Kasper Sandberg @ 2009-12-18 10:56 UTC (permalink / raw) To: Jason Garrett-Glaser Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, LKML Mailinglist, Linus Torvalds On Thu, 2009-12-17 at 17:18 -0800, Jason Garrett-Glaser wrote: > On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote: > >> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote: > >> > >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote: > >> > > well well :) nothing quite speaks out like graphs.. > >> > > > >> > > http://doom10.org/index.php?topic=78.0 > >> > > > >> > > > >> > > > >> > > regards, > >> > > Kasper Sandberg > >> > > >> > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied > >> > it--and given the strict thread-ordering expectations of x264, you basically > >> > can't expect it to do any better, though I'm curious what's responsible for > >> > the gap in "veryslow", even with SCHED_BATCH enabled. > >> > > >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS > >> > when we enable SCHED_BATCH. We're doing some further testing to see exactly > > > > Thats kinda besides the point. > > > > all these tunables and weirdness is _NEVER_ going to work for people. > > Can't individually applications request SCHED_BATCH? Our plan was to > have x264 simply detect if it was necessary (once we figure out what > encoding settings result in the large gap situation) and automatically > enable it for the current application. that is an insane solution, especially considering better schedulers outperform cfs SCHED_BATCH without doing ANYTHING special. Do you not see what is happening here? it is simply grotesk > > Jason ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2009-12-22 7:39 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-17 9:33 x264 benchmarks BFS vs CFS Kasper Sandberg 2009-12-17 10:42 ` Jason Garrett-Glaser 2009-12-17 10:53 ` Ingo Molnar 2009-12-17 11:00 ` Kasper Sandberg 2009-12-17 12:08 ` Ingo Molnar 2009-12-17 12:35 ` Kasper Sandberg 2009-12-17 15:47 ` Arjan van de Ven 2009-12-17 13:30 ` Mike Galbraith 2009-12-18 10:54 ` Kasper Sandberg 2009-12-18 11:41 ` Mike Galbraith 2009-12-17 21:22 ` Thomas Fjellstrom 2009-12-18 10:56 ` Kasper Sandberg 2009-12-18 1:18 ` Jason Garrett-Glaser 2009-12-18 5:23 ` Ingo Molnar 2009-12-18 7:30 ` Mike Galbraith 2009-12-18 10:11 ` Jason Garrett-Glaser 2009-12-18 12:49 ` Mike Galbraith 2009-12-18 13:06 ` Ingo Molnar 2009-12-18 13:36 ` Mike Galbraith 2009-12-18 13:53 ` Mike Galbraith 2009-12-18 10:57 ` Kasper Sandberg 2009-12-18 11:05 ` Jason Garrett-Glaser 2009-12-19 1:08 ` Con Kolivas 2009-12-19 4:03 ` Mike Galbraith 2009-12-19 17:36 ` Kasper Sandberg 2009-12-19 20:57 ` Mike Galbraith 2009-12-20 3:22 ` Andres Freund 2009-12-20 12:10 ` Kasper Sandberg 2009-12-20 13:09 ` Kasper Sandberg 2009-12-20 15:13 ` Mike Galbraith 2009-12-20 15:51 ` Mike Galbraith 2009-12-22 7:33 ` Jason Garrett-Glaser 2009-12-22 7:39 ` Jason Garrett-Glaser 2009-12-18 10:56 ` Kasper Sandberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox