x264 benchmarks BFS vs CFS

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* x264 benchmarks BFS vs CFS
@ 2009-12-17  9:33 Kasper Sandberg
  2009-12-17 10:42 ` Jason Garrett-Glaser
  0 siblings, 1 reply; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-17  9:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: LKML Mailinglist

well well :) nothing quite speaks out like graphs..

http://doom10.org/index.php?topic=78.0



regards,
Kasper Sandberg


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17  9:33 x264 benchmarks BFS vs CFS Kasper Sandberg
@ 2009-12-17 10:42 ` Jason Garrett-Glaser
  2009-12-17 10:53   ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Garrett-Glaser @ 2009-12-17 10:42 UTC (permalink / raw)
  To: Kasper Sandberg; +Cc: Ingo Molnar, LKML Mailinglist

On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> well well :) nothing quite speaks out like graphs..
>
> http://doom10.org/index.php?topic=78.0
>
>
>
> regards,
> Kasper Sandberg

Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically
tied it--and given the strict thread-ordering expectations of x264,
you basically can't expect it to do any better, though I'm curious
what's responsible for the gap in "veryslow", even with SCHED_BATCH
enabled.

The most odd case is that of "ultrafast", in which CFS immediately
ties BFS when we enable SCHED_BATCH.  We're doing some further testing
to see exactly what the conditions of this are--is it because
ultrafast is just so much faster than all the other modes and so
switches threads/loads faster?  Is it because ultrafast has relatively
equal workload among the threads, unlike the other loads?  We'll
probably know soon.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 10:42 ` Jason Garrett-Glaser
@ 2009-12-17 10:53   ` Ingo Molnar
  2009-12-17 11:00     ` Kasper Sandberg
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-12-17 10:53 UTC (permalink / raw)
  To: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra
  Cc: Kasper Sandberg, LKML Mailinglist


* Jason Garrett-Glaser <darkshikari@gmail.com> wrote:

> On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > well well :) nothing quite speaks out like graphs..
> >
> > http://doom10.org/index.php?topic=78.0
> >
> >
> >
> > regards,
> > Kasper Sandberg
> 
> Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> it--and given the strict thread-ordering expectations of x264, you basically 
> can't expect it to do any better, though I'm curious what's responsible for 
> the gap in "veryslow", even with SCHED_BATCH enabled.
> 
> The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> when we enable SCHED_BATCH.  We're doing some further testing to see exactly 
> what the conditions of this are--is it because ultrafast is just so much 
> faster than all the other modes and so switches threads/loads faster?  Is it 
> because ultrafast has relatively equal workload among the threads, unlike 
> the other loads?  We'll probably know soon.

Thanks for testing it!

Btw., you might want to make use of 'perf sched record', 'perf sched map', 
'perf sched trace' etc. to get an insight into how a particular workload 
schedules and why those decisions are done. (You'll need CONFIG_SCHED_DEBUG=y 
for best results.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 10:53   ` Ingo Molnar
@ 2009-12-17 11:00     ` Kasper Sandberg
  2009-12-17 12:08       ` Ingo Molnar
                         ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-17 11:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> 
> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > well well :) nothing quite speaks out like graphs..
> > >
> > > http://doom10.org/index.php?topic=78.0
> > >
> > >
> > >
> > > regards,
> > > Kasper Sandberg
> > 
> > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> > it--and given the strict thread-ordering expectations of x264, you basically 
> > can't expect it to do any better, though I'm curious what's responsible for 
> > the gap in "veryslow", even with SCHED_BATCH enabled.
> > 
> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> > when we enable SCHED_BATCH.  We're doing some further testing to see exactly 

Thats kinda besides the point.

all these tunables and weirdness is _NEVER_ going to work for people.

now forgive me for being so blunt, but for a user, having to do
echo x264 > /proc/cfs/gief_me_performance_on_app
or
echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app

just isnt usable, bfs matches, even exceeds cfs on all accounts, with
ZERO user tuning, so while cfs may be able to nearly match up with a ton
of application specific stuff, that just doesnt work for a normal user.

not to mention that bfs does this whilst not loosing interactivity,
something which cfs certainly cannot boast.

<snip>


> Thanks,
> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 11:00     ` Kasper Sandberg
@ 2009-12-17 12:08       ` Ingo Molnar
  2009-12-17 12:35         ` Kasper Sandberg
  2009-12-17 15:47         ` Arjan van de Ven
  2009-12-17 13:30       ` Mike Galbraith
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Ingo Molnar @ 2009-12-17 12:08 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds


* Kasper Sandberg <lkml@metanurb.dk> wrote:

> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > 
> > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > > well well :) nothing quite speaks out like graphs..
> > > >
> > > > http://doom10.org/index.php?topic=78.0
> > > >
> > > >
> > > >
> > > > regards,
> > > > Kasper Sandberg
> > > 
> > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> > > it--and given the strict thread-ordering expectations of x264, you basically 
> > > can't expect it to do any better, though I'm curious what's responsible for 
> > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > 
> > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> > > when we enable SCHED_BATCH.  We're doing some further testing to see exactly 
> 
> Thats kinda besides the point.
> 
> all these tunables and weirdness is _NEVER_ going to work for people.

v2.6.32 improved quite a bit on the x264 front so i dont think that's 
necessarily the case.

But yes, i'll subscribe to the view that we cannot satisfy everything all the 
time. There's tradeoffs in every scheduler design.

> now forgive me for being so blunt, but for a user, having to do
> echo x264 > /proc/cfs/gief_me_performance_on_app
> or
> echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> 
> just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO 
> user tuning, so while cfs may be able to nearly match up with a ton of 
> application specific stuff, that just doesnt work for a normal user.
> 
> not to mention that bfs does this whilst not loosing interactivity, 
> something which cfs certainly cannot boast.

What kind of latencies are those? Arent they just compiz induced due to 
different weighting of workloads in BFS and in the upstream scheduler?
Would you be willing to help us out pinning them down?

To move the discussion to the numeric front please send the 'perf sched 
latency' output of an affected workload.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 12:08       ` Ingo Molnar
@ 2009-12-17 12:35         ` Kasper Sandberg
  2009-12-17 15:47         ` Arjan van de Ven
  1 sibling, 0 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-17 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Garrett-Glaser, Mike Galbraith, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Thu, 2009-12-17 at 13:08 +0100, Ingo Molnar wrote:
> * Kasper Sandberg <lkml@metanurb.dk> wrote:
> 
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > > 
> > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > > > well well :) nothing quite speaks out like graphs..
> > > > >
> > > > > http://doom10.org/index.php?topic=78.0
> > > > >
> > > > >
> > > > >
> > > > > regards,
> > > > > Kasper Sandberg
> > > > 
> > > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> > > > it--and given the strict thread-ordering expectations of x264, you basically 
> > > > can't expect it to do any better, though I'm curious what's responsible for 
> > > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > > 
> > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> > > > when we enable SCHED_BATCH.  We're doing some further testing to see exactly 
> > 
> > Thats kinda besides the point.
> > 
> > all these tunables and weirdness is _NEVER_ going to work for people.
> 
> v2.6.32 improved quite a bit on the x264 front so i dont think that's 
> necessarily the case.

again, pretty much application specific, and furthermore, ONLY with
SCHED_BATCH is it near BFS. as you know, SCHED_BATCH isnt exactly what
you wanna do for desktop or other interactivity-hungry tasks? bfs
manages better performance than cfs with SCHED_BATCH, without
SCHED_BATCH

> 
> But yes, i'll subscribe to the view that we cannot satisfy everything all the 
> time. There's tradeoffs in every scheduler design.
yet getting not even as good on average performance from CFS as BFS,
requires tunables, swtiching scheduler policies etc

> 
> > now forgive me for being so blunt, but for a user, having to do
> > echo x264 > /proc/cfs/gief_me_performance_on_app
> > or
> > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> > 
> > just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO 
> > user tuning, so while cfs may be able to nearly match up with a ton of 
> > application specific stuff, that just doesnt work for a normal user.

^^^^ This is also something you need to consider.

> > 
> > not to mention that bfs does this whilst not loosing interactivity, 
> > something which cfs certainly cannot boast.
> 
> What kind of latencies are those? Arent they just compiz induced due to 
> different weighting of workloads in BFS and in the upstream scheduler?
> Would you be willing to help us out pinning them down?

Theres not much i can do, i dont have time to switch kernels on my
systems, all i can give you is this simple information, on my systems,
ranging from embedded to dual core2 quad and core i7, BFS manages to
give lower latencies(aka jack doesnt skip with very low latency output,
everythings smoother, even measurably on the desktop), greater
performance(as evidenced by lots of benchmarks, including those i
posted), and that is without touching a single scheduler policy or
tunable at all.

Im well aware that CFS can be tweaked via tunables/policies to achieve a
single of these goals at a time, and im also well aware you cannot ever
do every single cornercase perfectly with one scheduler, however, and
consider this very thoroughly, bfs manages without any tunables, to do
the vast majority of the cases with an excellence CFS can not even 100%
match, even tunables and scheduler polices.. and that is with ALOT less
code aswell.. This ought to tell you that something can and should be
done.

> 
> To move the discussion to the numeric front please send the 'perf sched 
> latency' output of an affected workload.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 12:08       ` Ingo Molnar
  2009-12-17 12:35         ` Kasper Sandberg
@ 2009-12-17 15:47         ` Arjan van de Ven
  1 sibling, 0 replies; 34+ messages in thread
From: Arjan van de Ven @ 2009-12-17 15:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kasper Sandberg, Jason Garrett-Glaser, Mike Galbraith,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Thu, 17 Dec 2009 13:08:26 +0100
Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > not to mention that bfs does this whilst not loosing interactivity, 
> > something which cfs certainly cannot boast.
> 
> What kind of latencies are those? Arent they just compiz induced due
> to different weighting of workloads in BFS and in the upstream
> scheduler? Would you be willing to help us out pinning them down?
> 
> To move the discussion to the numeric front please send the 'perf
> sched latency' output of an affected workload.

CFS in .32 and before has one known, and now fixed latency issue.
In .32, wake_up() (which is most causes for inter thread communication
and lots of others) was trying to keep the waker and wakee on the same
logical cpu at pretty much all cost. In .33-git, Mike fixed this to,
if there's a free logical cpu sibling, or on a multicore cpu, another
core which shares the cache, to just schedule the new task on that free
cpu rather than on the current, guaranteed busy, cpu.

This change helps latency a lot, and as a result, performance for
various latency sensitive workloads... 

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 11:00     ` Kasper Sandberg
  2009-12-17 12:08       ` Ingo Molnar
@ 2009-12-17 13:30       ` Mike Galbraith
  2009-12-18 10:54         ` Kasper Sandberg
  2009-12-17 21:22       ` Thomas Fjellstrom
  2009-12-18  1:18       ` Jason Garrett-Glaser
  3 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-12-17 13:30 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > 
> > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > > well well :) nothing quite speaks out like graphs..
> > > >
> > > > http://doom10.org/index.php?topic=78.0
> > > >
> > > >
> > > >
> > > > regards,
> > > > Kasper Sandberg
> > > 
> > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> > > it--and given the strict thread-ordering expectations of x264, you basically 
> > > can't expect it to do any better, though I'm curious what's responsible for 
> > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > 
> > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> > > when we enable SCHED_BATCH.  We're doing some further testing to see exactly 
> 
> Thats kinda besides the point.
> 
> all these tunables and weirdness is _NEVER_ going to work for people.

Fact is, it is working for a great number of people, the vast majority
of whom don't even know where the knobs are, much less what they do. 

> now forgive me for being so blunt, but for a user, having to do
> echo x264 > /proc/cfs/gief_me_performance_on_app
> or
> echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app

Theatrics noted.

> just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> ZERO user tuning, so while cfs may be able to nearly match up with a ton
> of application specific stuff, that just doesnt work for a normal user.

Seems you haven't done much benchmarking.  BFS has strengths as well as
weaknesses, all schedulers do.

> not to mention that bfs does this whilst not loosing interactivity,
> something which cfs certainly cannot boast.

Not true.  I sent Con hard evidence of a severe problem area wrt
interactivity, and hard numbers showing other places where BFS needs
some work.  But hey, if BFS blows your skirt up, use it and be happy.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 13:30       ` Mike Galbraith
@ 2009-12-18 10:54         ` Kasper Sandberg
  2009-12-18 11:41           ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-18 10:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Thu, 2009-12-17 at 14:30 +0100, Mike Galbraith wrote:
> On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > > 
> > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > > > well well :) nothing quite speaks out like graphs..
> > > > >
> > > > > http://doom10.org/index.php?topic=78.0
> > > > >
> > > > >
> > > > >
> > > > > regards,
> > > > > Kasper Sandberg
> > > > 
> > > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> > > > it--and given the strict thread-ordering expectations of x264, you basically 
> > > > can't expect it to do any better, though I'm curious what's responsible for 
> > > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > > 
> > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> > > > when we enable SCHED_BATCH.  We're doing some further testing to see exactly 
> > 
> > Thats kinda besides the point.
> > 
> > all these tunables and weirdness is _NEVER_ going to work for people.
> 
> Fact is, it is working for a great number of people, the vast majority
> of whom don't even know where the knobs are, much less what they do. 
but not as great as it could be :)

> 
> > now forgive me for being so blunt, but for a user, having to do
> > echo x264 > /proc/cfs/gief_me_performance_on_app
> > or
> > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> 
> Theatrics noted.
> 
> > just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> > ZERO user tuning, so while cfs may be able to nearly match up with a ton
> > of application specific stuff, that just doesnt work for a normal user.
> 
> Seems you haven't done much benchmarking.  BFS has strengths as well as
> weaknesses, all schedulers do.
yeah, BFS just has more strengths and fewer weaknesses than CFS :)
> 
> > not to mention that bfs does this whilst not loosing interactivity,
> > something which cfs certainly cannot boast.
> 
> Not true.  I sent Con hard evidence of a severe problem area wrt
> interactivity, and hard numbers showing other places where BFS needs
> some work.  But hey, if BFS blows your skirt up, use it and be happy.
Theatrics noted.

As for your point, well.. as far as i have heard, all you've come up
with is COMPLETELY WORTHLESS use cases which nobody is ever EVAR going
to do, and thus irellevant
> 
> 	-Mike
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 10:54         ` Kasper Sandberg
@ 2009-12-18 11:41           ` Mike Galbraith
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-12-18 11:41 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Fri, 2009-12-18 at 11:54 +0100, Kasper Sandberg wrote:
> On Thu, 2009-12-17 at 14:30 +0100, Mike Galbraith wrote:
> > On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote:
> > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > > > 
> > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > > > > well well :) nothing quite speaks out like graphs..
> > > > > >
> > > > > > http://doom10.org/index.php?topic=78.0
> > > > > >
> > > > > >
> > > > > >
> > > > > > regards,
> > > > > > Kasper Sandberg
> > > > > 
> > > > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied 
> > > > > it--and given the strict thread-ordering expectations of x264, you basically 
> > > > > can't expect it to do any better, though I'm curious what's responsible for 
> > > > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > > > 
> > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS 
> > > > > when we enable SCHED_BATCH.  We're doing some further testing to see exactly 
> > > 
> > > Thats kinda besides the point.
> > > 
> > > all these tunables and weirdness is _NEVER_ going to work for people.
> > 
> > Fact is, it is working for a great number of people, the vast majority
> > of whom don't even know where the knobs are, much less what they do. 
> but not as great as it could be :)
> 
> > 
> > > now forgive me for being so blunt, but for a user, having to do
> > > echo x264 > /proc/cfs/gief_me_performance_on_app
> > > or
> > > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> > 
> > Theatrics noted.
> > 
> > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> > > ZERO user tuning, so while cfs may be able to nearly match up with a ton
> > > of application specific stuff, that just doesnt work for a normal user.
> > 
> > Seems you haven't done much benchmarking.  BFS has strengths as well as
> > weaknesses, all schedulers do.
> yeah, BFS just has more strengths and fewer weaknesses than CFS :)
> > 
> > > not to mention that bfs does this whilst not loosing interactivity,
> > > something which cfs certainly cannot boast.
> > 
> > Not true.  I sent Con hard evidence of a severe problem area wrt
> > interactivity, and hard numbers showing other places where BFS needs
> > some work.  But hey, if BFS blows your skirt up, use it and be happy.
> Theatrics noted.
> 
> As for your point, well.. as far as i have heard, all you've come up
> with is COMPLETELY WORTHLESS use cases which nobody is ever EVAR going
> to do, and thus irellevant

Goodbye troll.

*PLONK*


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 11:00     ` Kasper Sandberg
  2009-12-17 12:08       ` Ingo Molnar
  2009-12-17 13:30       ` Mike Galbraith
@ 2009-12-17 21:22       ` Thomas Fjellstrom
  2009-12-18 10:56         ` Kasper Sandberg
  2009-12-18  1:18       ` Jason Garrett-Glaser
  3 siblings, 1 reply; 34+ messages in thread
From: Thomas Fjellstrom @ 2009-12-17 21:22 UTC (permalink / raw)
  To: linux-kernel

On Thu December 17 2009, Kasper Sandberg wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> 
wrote:
> > > > well well :) nothing quite speaks out like graphs..
> > > >
> > > > http://doom10.org/index.php?topic=78.0
> > > >
> > > >
> > > >
> > > > regards,
> > > > Kasper Sandberg
> > >
> > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically
> > > tied it--and given the strict thread-ordering expectations of x264,
> > > you basically can't expect it to do any better, though I'm curious
> > > what's responsible for the gap in "veryslow", even with SCHED_BATCH
> > > enabled.
> > >
> > > The most odd case is that of "ultrafast", in which CFS immediately
> > > ties BFS when we enable SCHED_BATCH.  We're doing some further
> > > testing to see exactly
> 
> Thats kinda besides the point.
> 
> all these tunables and weirdness is _NEVER_ going to work for people.
> 
> now forgive me for being so blunt, but for a user, having to do
> echo x264 > /proc/cfs/gief_me_performance_on_app
> or
> echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> 
> just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> ZERO user tuning, so while cfs may be able to nearly match up with a ton
> of application specific stuff, that just doesnt work for a normal user.
> 
> not to mention that bfs does this whilst not loosing interactivity,
> something which cfs certainly cannot boast.
> 
> <snip>

Strange, I seem to recall that BFS needs you to run apps with some silly 
schedtool program to get media apps to not skip while doing other tasks. (I 
don't have to tweak CFS at all)

> > Thanks,
> >
> > 	Ingo



-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 21:22       ` Thomas Fjellstrom
@ 2009-12-18 10:56         ` Kasper Sandberg
  0 siblings, 0 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-18 10:56 UTC (permalink / raw)
  To: tfjellstrom; +Cc: linux-kernel

On Thu, 2009-12-17 at 14:22 -0700, Thomas Fjellstrom wrote:
> On Thu December 17 2009, Kasper Sandberg wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> 
> wrote:
> > > > > well well :) nothing quite speaks out like graphs..
> > > > >
> > > > > http://doom10.org/index.php?topic=78.0
> > > > >
> > > > >
> > > > >
> > > > > regards,
> > > > > Kasper Sandberg
> > > >
> > > > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically
> > > > tied it--and given the strict thread-ordering expectations of x264,
> > > > you basically can't expect it to do any better, though I'm curious
> > > > what's responsible for the gap in "veryslow", even with SCHED_BATCH
> > > > enabled.
> > > >
> > > > The most odd case is that of "ultrafast", in which CFS immediately
> > > > ties BFS when we enable SCHED_BATCH.  We're doing some further
> > > > testing to see exactly
> > 
> > Thats kinda besides the point.
> > 
> > all these tunables and weirdness is _NEVER_ going to work for people.
> > 
> > now forgive me for being so blunt, but for a user, having to do
> > echo x264 > /proc/cfs/gief_me_performance_on_app
> > or
> > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> > 
> > just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> > ZERO user tuning, so while cfs may be able to nearly match up with a ton
> > of application specific stuff, that just doesnt work for a normal user.
> > 
> > not to mention that bfs does this whilst not loosing interactivity,
> > something which cfs certainly cannot boast.
> > 
> > <snip>
> 
> Strange, I seem to recall that BFS needs you to run apps with some silly 
> schedtool program to get media apps to not skip while doing other tasks. (I 
> don't have to tweak CFS at all)
You recall incorrectly
> 
> > > Thanks,
> > >
> > > 	Ingo
> 
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-17 11:00     ` Kasper Sandberg
                         ` (2 preceding siblings ...)
  2009-12-17 21:22       ` Thomas Fjellstrom
@ 2009-12-18  1:18       ` Jason Garrett-Glaser
  2009-12-18  5:23         ` Ingo Molnar
  2009-12-18 10:56         ` Kasper Sandberg
  3 siblings, 2 replies; 34+ messages in thread
From: Jason Garrett-Glaser @ 2009-12-18  1:18 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds

On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
>> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
>>
>> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
>> > > well well :) nothing quite speaks out like graphs..
>> > >
>> > > http://doom10.org/index.php?topic=78.0
>> > >
>> > >
>> > >
>> > > regards,
>> > > Kasper Sandberg
>> >
>> > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied
>> > it--and given the strict thread-ordering expectations of x264, you basically
>> > can't expect it to do any better, though I'm curious what's responsible for
>> > the gap in "veryslow", even with SCHED_BATCH enabled.
>> >
>> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
>> > when we enable SCHED_BATCH.  We're doing some further testing to see exactly
>
> Thats kinda besides the point.
>
> all these tunables and weirdness is _NEVER_ going to work for people.

Can't individually applications request SCHED_BATCH?  Our plan was to
have x264 simply detect if it was necessary (once we figure out what
encoding settings result in the large gap situation) and automatically
enable it for the current application.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18  1:18       ` Jason Garrett-Glaser
@ 2009-12-18  5:23         ` Ingo Molnar
  2009-12-18  7:30           ` Mike Galbraith
  2009-12-18 10:56         ` Kasper Sandberg
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-12-18  5:23 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Kasper Sandberg, Mike Galbraith, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds


* Jason Garrett-Glaser <darkshikari@gmail.com> wrote:

> On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> >> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> >>
> >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> >> > > well well :) nothing quite speaks out like graphs..
> >> > >
> >> > > http://doom10.org/index.php?topic=78.0
> >> > >
> >> > >
> >> > >
> >> > > regards,
> >> > > Kasper Sandberg
> >> >
> >> > Yeah, I sent this to Mike a bit ago. ?Seems that .32 has basically tied
> >> > it--and given the strict thread-ordering expectations of x264, you basically
> >> > can't expect it to do any better, though I'm curious what's responsible for
> >> > the gap in "veryslow", even with SCHED_BATCH enabled.
> >> >
> >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> >> > when we enable SCHED_BATCH. ?We're doing some further testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
> 
> Can't individually applications request SCHED_BATCH?  Our plan was to have 
> x264 simply detect if it was necessary (once we figure out what encoding 
> settings result in the large gap situation) and automatically enable it for 
> the current application.

Yeah, SCHED_BATCH can be requested at will by an app. It's an unprivileged 
operation. It gets passed down to child tasks. (You can just do it 
unconditionally - older kernels will ignore it and give you an error code for 
setscheduler call.)

Having said that, we generally try to make things perform well without apps 
having to switch themselves to SCHED_BATCH. Mike, do you think we can make 
x264 perform as well (or nearly as well) under SCHED_OTHER as under 
SCHED_BATCH?

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18  5:23         ` Ingo Molnar
@ 2009-12-18  7:30           ` Mike Galbraith
  2009-12-18 10:11             ` Jason Garrett-Glaser
  2009-12-18 10:57             ` Kasper Sandberg
  0 siblings, 2 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-12-18  7:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Garrett-Glaser, Kasper Sandberg, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:

> Having said that, we generally try to make things perform well without apps 
> having to switch themselves to SCHED_BATCH. Mike, do you think we can make 
> x264 perform as well (or nearly as well) under SCHED_OTHER as under 
> SCHED_BATCH?

It's not bad as is, except for ultrafast mode.  START_DEBIT is the
biggest problem there.  I don't think SCHED_OTHER will ever match
SCHED_BATCH for this load, though I must say I haven't full-spectrum
tested.  This load really wants RR scheduling, and wakeup preemption
necessarily perturbs run order.

I'll probably piddle with it some more, it's an interesting load.

	-Mike

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18  7:30           ` Mike Galbraith
@ 2009-12-18 10:11             ` Jason Garrett-Glaser
  2009-12-18 12:49               ` Mike Galbraith
  2009-12-18 10:57             ` Kasper Sandberg
  1 sibling, 1 reply; 34+ messages in thread
From: Jason Garrett-Glaser @ 2009-12-18 10:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 3234 bytes --]

On Thu, Dec 17, 2009 at 11:30 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
>
>> Having said that, we generally try to make things perform well without apps
>> having to switch themselves to SCHED_BATCH. Mike, do you think we can make
>> x264 perform as well (or nearly as well) under SCHED_OTHER as under
>> SCHED_BATCH?
>
> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
> biggest problem there.  I don't think SCHED_OTHER will ever match
> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> tested.  This load really wants RR scheduling, and wakeup preemption
> necessarily perturbs run order.
>
> I'll probably piddle with it some more, it's an interesting load.
>
>        -Mike
>
>

Two more thoughts here:

1) We're considering moving to a thread pool soon; we already have a
working patch for it and if anything it'll save a few clocks spent on
nice()ing threads and other such things.  Will this improve
START_DEBIT at all?  I've attached the beta patch if you want to try
it.  Note this also works with 2) as well, so it adds yet another
dimension to what's mentioned below.

2) We recently implemented a new threading model which may be
interesting to test as well.  This threading model gives worse
compression *and* performance, but has one benefit: it adds zero
latency, whereas normal threading adds a full frame of latency per
thread.  This was paid for by a company interested in
ultra-low-latency streaming applications, where 1 millisecond is a
huge deal.  I've been thinking this might be interesting to bench from
a kernel perspective as well, as when you're spawning a half-dozen
threads and need them all done within 6 milliseconds, you start
getting down to serious scheduler issues.

The new threading model is much less complex than the regular one and
works as follows.  The frame is split into X slices, and each slice
encoded with one thread.  Specifically, it works via the following
process:

1.  Preprocess input frame, perform lookahead analysis on input frame
(all singlethreaded)
2.  Split up a ton of threads to do the main encode, one per slice.
3.  Join all the threads.
4.  Do post-filtering on the output frame, return.

Clearly this is an utter disaster, since it spawns N times as many
threads as the old threading model *and* they last far shorter, *and*
only part of the application is multithreaded.  But there's not really
a better way to do low-latency threading, and it's an interesting
challenge to boot.  IIRC, it's also the way ffmpeg's encoder threading
works.  It's widely considered an inferior model, but as mentioned
before, in this particular use-case there's no choice.

To enable this, use --sliced-threads.  I'd recommend using a
higher-resolution clip for this, as it performs atrociously bad on
very low resolution videos for reasons you might be able to guess.  If
you need a higher-res clip, check the SD or HD ones here:
http://media.xiph.org/video/derf/ .

I'm personally curious as to what kind of scheduler issues this
results in--I haven't done any BFS vs CFS tests with this option
enabled yet.

Jason

[-- Attachment #2: thread_pool_slices.diff --]
[-- Type: application/octet-stream, Size: 13295 bytes --]

diff --git a/common/common.h b/common/common.h
index 417ac9e..28d6c1d 100644
--- a/common/common.h
+++ b/common/common.h
@@ -337,12 +337,20 @@ struct x264_t
     /* encoder parameters */
     x264_param_t    param;
 
-    x264_t          *thread[X264_THREAD_MAX+1];
-    x264_pthread_t  thread_handle;
-    int             b_thread_active;
-    int             i_thread_phase; /* which thread to use for the next frame */
-    int             i_threadslice_start; /* first row in this thread slice */
-    int             i_threadslice_end; /* row after the end of this thread slice */
+    x264_t               *thread[X264_THREAD_MAX+1]; /* contexts for each frame in progress + lookahead */
+    x264_pthread_t       *thread_handle;
+    x264_pthread_cond_t  thread_queue_cv;
+    x264_pthread_mutex_t thread_queue_mutex;
+    x264_t               **thread_queue; /* frames that have been prepared but not yet claimed by a worker thread */
+    x264_pthread_cond_t  thread_active_cv;
+    x264_pthread_mutex_t thread_active_mutex;
+    int                  thread_active;
+    int                  b_thread_active;
+    int                  i_thread_phase; /* which thread to use for the next frame */
+    int                  thread_exit;
+    int                  thread_error;
+    int                  i_threadslice_start; /* first row in this thread slice */
+    int                  i_threadslice_end; /* row after the end of this thread slice */
 
     /* bitstream output */
     struct
diff --git a/encoder/encoder.c b/encoder/encoder.c
index 0c0010f..bc0e75b 100644
--- a/encoder/encoder.c
+++ b/encoder/encoder.c
@@ -47,6 +47,53 @@ static int x264_encoder_frame_end( x264_t *h, x264_t *thread_current,
                                    x264_nal_t **pp_nal, int *pi_nal,
                                    x264_picture_t *pic_out );
 
+/* threading */
+
+static void *x264_slices_write_thread( x264_t *h );
+
+#ifdef HAVE_PTHREAD
+static void x264_int_cond_broadcast( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val )
+{
+    x264_pthread_mutex_lock( mutex );
+    *var = val;
+    x264_pthread_cond_broadcast( cv );
+    x264_pthread_mutex_unlock( mutex );
+}
+
+static void x264_int_cond_wait( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val )
+{
+    x264_pthread_mutex_lock( mutex );
+    while( *var != val )
+        x264_pthread_cond_wait( cv, mutex );
+    x264_pthread_mutex_unlock( mutex );
+}
+
+#else
+static void x264_int_cond_broadcast( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val )
+{}
+static void x264_int_cond_wait( x264_pthread_cond_t *cv, x264_pthread_mutex_t *mutex, int *var, int val )
+{}
+#endif
+
+static void x264_thread_pool_push( x264_t *h )
+{
+    assert( h->thread_active == 0 );
+    h->thread_active = 1;
+    assert( h->b_thread_active == 0 );
+    h->b_thread_active = 1;
+    x264_pthread_mutex_lock( &h->thread[0]->thread_queue_mutex );
+    x264_frame_push( (void*)h->thread_queue, (void*)h );
+    x264_pthread_cond_broadcast( &h->thread[0]->thread_queue_cv );
+    x264_pthread_mutex_unlock( &h->thread[0]->thread_queue_mutex );
+}
+
+static int x264_thread_pool_wait( x264_t *h )
+{
+    x264_int_cond_wait( &h->thread_active_cv, &h->thread_active_mutex, &h->thread_active, 0 );
+    h->b_thread_active = 0;
+    return h->thread_error;
+}
+
 /****************************************************************************
  *
  ******************************* x264 libs **********************************
@@ -943,6 +990,16 @@ x264_t *x264_encoder_open( x264_param_t *param )
     for( i = 1; i < h->param.i_threads + !!h->param.i_sync_lookahead; i++ )
         CHECKED_MALLOC( h->thread[i], sizeof(x264_t) );
 
+    if( h->param.i_threads > 1 )
+    {
+        CHECKED_MALLOCZERO( h->thread_handle, (h->param.i_threads + 1) * sizeof(x264_pthread_t) );
+        CHECKED_MALLOCZERO( h->thread_queue, (h->param.i_threads + 1) * sizeof(x264_t*) );
+        if( x264_pthread_cond_init( &h->thread_queue_cv, NULL ) )
+            goto fail;
+        if( x264_pthread_mutex_init( &h->thread_queue_mutex, NULL ) )
+            goto fail;
+    }
+
     if( x264_lookahead_init( h, i_slicetype_length ) )
         goto fail;
 
@@ -967,6 +1024,14 @@ x264_t *x264_encoder_open( x264_param_t *param )
         CHECKED_MALLOC( h->thread[i]->out.nal, init_nal_count*sizeof(x264_nal_t) );
         h->thread[i]->out.i_nals_allocated = init_nal_count;
 
+        if( h->param.i_threads > 1 )
+        {
+            if( x264_pthread_cond_init( &h->thread[i]->thread_active_cv, NULL ) )
+                goto fail;
+            if( x264_pthread_mutex_init( &h->thread[i]->thread_active_mutex, NULL ) )
+                goto fail;
+        }
+
         if( allocate_threadlocal_data && x264_macroblock_cache_init( h->thread[i] ) < 0 )
             goto fail;
     }
@@ -1009,6 +1074,13 @@ x264_t *x264_encoder_open( x264_param_t *param )
         h->sps->i_profile_idc == PROFILE_HIGH ? "High" :
         "High 4:4:4 Predictive", h->sps->i_level_idc/10, h->sps->i_level_idc%10 );
 
+    if( h->param.i_threads > 1 )
+    {
+        for( i = 0; i < h->param.i_threads; i++ )
+            if( x264_pthread_create( &h->thread_handle[i], NULL, (void*)x264_slices_write_thread, h ) )
+                return NULL;
+    }
+
     return h;
 fail:
     x264_free( h );
@@ -1723,7 +1795,7 @@ static int x264_slice_write( x264_t *h )
             h->mb.b_reencode_mb = 0;
 
 #if VISUALIZE
-        if( h->param.b_visualize )
+        if( h->i_threads == 1 && h->param.b_visualize )
             x264_visualize_mb( h );
 #endif
 
@@ -1851,24 +1923,10 @@ static void x264_thread_sync_stat( x264_t *dst, x264_t *src )
     memcpy( &dst->stat.i_frame_count, &src->stat.i_frame_count, sizeof(dst->stat) - sizeof(dst->stat.frame) );
 }
 
-static void *x264_slices_write( x264_t *h )
+static int x264_slices_write_internal( x264_t *h )
 {
     int i_slice_num = 0;
     int last_thread_mb = h->sh.i_last_mb;
-    if( h->param.i_sync_lookahead )
-        x264_lower_thread_priority( 10 );
-
-#ifdef HAVE_MMX
-    /* Misalign mask has to be set separately for each thread. */
-    if( h->param.cpu&X264_CPU_SSE_MISALIGN )
-        x264_cpu_mask_misalign_sse();
-#endif
-
-#if VISUALIZE
-    if( h->param.b_visualize )
-        if( x264_visualize_init( h ) )
-            return (void *)-1;
-#endif
 
     /* init stats */
     memset( &h->stat.frame, 0, sizeof(h->stat.frame) );
@@ -1887,10 +1945,30 @@ static void *x264_slices_write( x264_t *h )
         }
         h->sh.i_last_mb = X264_MIN( h->sh.i_last_mb, last_thread_mb );
         if( x264_stack_align( x264_slice_write, h ) )
-            return (void *)-1;
+            return -1;
         h->sh.i_first_mb = h->sh.i_last_mb + 1;
     }
 
+    return 0;
+}
+
+static int x264_slices_write( x264_t *h )
+{
+#ifdef HAVE_MMX
+    /* Misalign mask has to be set separately for each thread. */
+    if( h->param.cpu&X264_CPU_SSE_MISALIGN )
+        x264_cpu_mask_misalign_sse();
+#endif
+
+#if VISUALIZE
+    if( h->param.b_visualize )
+        if( x264_visualize_init( h ) )
+            return -1;
+#endif
+
+    if( x264_slices_write_internal( h ) )
+        return -1;
+
 #if VISUALIZE
     if( h->param.b_visualize )
     {
@@ -1899,13 +1977,47 @@ static void *x264_slices_write( x264_t *h )
     }
 #endif
 
+    return 0;
+}
+
+static void *x264_slices_write_thread( x264_t *h )
+{
+    if( h->param.i_sync_lookahead )
+        x264_lower_thread_priority( 10 );
+
+#ifdef HAVE_MMX
+    /* Misalign mask has to be set separately for each thread. */
+    if( h->param.cpu&X264_CPU_SSE_MISALIGN )
+        x264_cpu_mask_misalign_sse();
+#endif
+
+    for(;;)
+    {
+        x264_t *t = NULL;
+
+        // get one frame from the queue
+        x264_pthread_mutex_lock( &h->thread_queue_mutex );
+        while( !h->thread_queue[0] && !h->thread_exit )
+            x264_pthread_cond_wait( &h->thread_queue_cv, &h->thread_queue_mutex );
+        if( h->thread_queue[0] )
+            t = (void*)x264_frame_shift( (void*)h->thread_queue );
+        x264_pthread_mutex_unlock( &h->thread_queue_mutex );
+        if( h->thread_exit )
+            return (void *)0;
+        if( !t )
+            continue;
+
+        t->thread_error = x264_slices_write_internal( t );
+
+        x264_int_cond_broadcast( &t->thread_active_cv, &t->thread_active_mutex, &t->thread_active, 0 );
+    }
+
     return (void *)0;
 }
 
 static int x264_threaded_slices_write( x264_t *h )
 {
     int i, j;
-    void *ret = NULL;
     /* set first/last mb and sync contexts */
     for( i = 0; i < h->param.i_threads; i++ )
     {
@@ -1928,14 +2040,10 @@ static int x264_threaded_slices_write( x264_t *h )
 
     /* dispatch */
     for( i = 0; i < h->param.i_threads; i++ )
-        if( x264_pthread_create( &h->thread[i]->thread_handle, NULL, (void*)x264_slices_write, (void*)h->thread[i] ) )
-            return -1;
+        x264_thread_pool_push( h->thread[i] );
     for( i = 0; i < h->param.i_threads; i++ )
-    {
-        x264_pthread_join( h->thread[i]->thread_handle, &ret );
-        if( (intptr_t)ret )
-            return (intptr_t)ret;
-    }
+        if( x264_thread_pool_wait( h->thread[i] ) )
+            return -1;
 
     /* deblocking and hpel filtering */
     for( i = 0; i <= h->sps->i_mb_height; i++ )
@@ -2238,18 +2346,14 @@ int     x264_encoder_encode( x264_t *h,
     h->i_threadslice_start = 0;
     h->i_threadslice_end = h->sps->i_mb_height;
     if( !h->param.b_sliced_threads && h->param.i_threads > 1 )
-    {
-        if( x264_pthread_create( &h->thread_handle, NULL, (void*)x264_slices_write, h ) )
-            return -1;
-        h->b_thread_active = 1;
-    }
+        x264_thread_pool_push( h );
     else if( h->param.b_sliced_threads )
     {
         if( x264_threaded_slices_write( h ) )
             return -1;
     }
     else
-        if( (intptr_t)x264_slices_write( h ) )
+        if( x264_slices_write( h ) )
             return -1;
 
     return x264_encoder_frame_end( thread_oldest, thread_current, pp_nal, pi_nal, pic_out );
@@ -2263,13 +2367,8 @@ static int x264_encoder_frame_end( x264_t *h, x264_t *thread_current,
     char psz_message[80];
 
     if( h->b_thread_active )
-    {
-        void *ret = NULL;
-        x264_pthread_join( h->thread_handle, &ret );
-        if( (intptr_t)ret )
-            return (intptr_t)ret;
-        h->b_thread_active = 0;
-    }
+        if( x264_thread_pool_wait( h ) )
+            return -1;
     if( !h->out.i_nal )
     {
         pic_out->i_type = X264_TYPE_AUTO;
@@ -2472,15 +2571,29 @@ void    x264_encoder_close  ( x264_t *h )
 
     x264_lookahead_delete( h );
 
-    for( i = 0; i < h->param.i_threads; i++ )
+    if( h->param.i_threads > 1 )
     {
         // don't strictly have to wait for the other threads, but it's simpler than canceling them
-        if( h->thread[i]->b_thread_active )
+        x264_pthread_mutex_lock( &h->thread_queue_mutex );
+        h->thread_exit = 1;
+        x264_pthread_cond_broadcast( &h->thread_queue_cv );
+        x264_pthread_mutex_unlock( &h->thread_queue_mutex );
+        for( i = 0; i < h->param.i_threads; i++ )
+            x264_pthread_join( h->thread_handle[i], NULL );
+        for( i = 0; i < h->param.i_threads; i++ )
         {
-            x264_pthread_join( h->thread[i]->thread_handle, NULL );
-            assert( h->thread[i]->fenc->i_reference_count == 1 );
-            x264_frame_delete( h->thread[i]->fenc );
+            x264_pthread_cond_destroy( &h->thread[i]->thread_active_cv );
+            x264_pthread_mutex_destroy( &h->thread[i]->thread_active_mutex );
+            if( h->thread[i]->b_thread_active )
+            {
+                assert( h->thread[i]->fenc->i_reference_count == 1 );
+                x264_frame_delete( h->thread[i]->fenc );
+            }
         }
+        x264_pthread_cond_destroy( &h->thread_queue_cv );
+        x264_pthread_mutex_destroy( &h->thread_queue_mutex );
+        x264_free( h->thread_handle );
+        x264_free( h->thread_queue );
     }
 
     if( h->param.i_threads > 1 && !h->param.b_sliced_threads )
diff --git a/encoder/lookahead.c b/encoder/lookahead.c
index f33b167..039b9cb 100644
--- a/encoder/lookahead.c
+++ b/encoder/lookahead.c
@@ -152,7 +152,7 @@ int x264_lookahead_init( x264_t *h, int i_slicetype_length )
     if( x264_macroblock_cache_init( look_h ) )
         goto fail;
 
-    if( x264_pthread_create( &look_h->thread_handle, NULL, (void *)x264_lookahead_thread, look_h ) )
+    if( x264_pthread_create( &h->thread_handle[h->param.i_threads], NULL, (void *)x264_lookahead_thread, look_h ) )
         goto fail;
     look->b_thread_active = 1;
 
@@ -170,7 +170,7 @@ void x264_lookahead_delete( x264_t *h )
         h->lookahead->b_exit_thread = 1;
         x264_pthread_cond_broadcast( &h->lookahead->ifbuf.cv_fill );
         x264_pthread_mutex_unlock( &h->lookahead->ifbuf.mutex );
-        x264_pthread_join( h->thread[h->param.i_threads]->thread_handle, NULL );
+        x264_pthread_join( h->thread_handle[h->param.i_threads], NULL );
         x264_macroblock_cache_end( h->thread[h->param.i_threads] );
         x264_free( h->thread[h->param.i_threads]->scratch_buffer );
         x264_free( h->thread[h->param.i_threads] );

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 10:11             ` Jason Garrett-Glaser
@ 2009-12-18 12:49               ` Mike Galbraith
  2009-12-18 13:06                 ` Ingo Molnar
  2009-12-18 13:53                 ` Mike Galbraith
  0 siblings, 2 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-12-18 12:49 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Ingo Molnar, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds

On Fri, 2009-12-18 at 02:11 -0800, Jason Garrett-Glaser wrote:

> Two more thoughts here:
> 
> 1) We're considering moving to a thread pool soon; we already have a
> working patch for it and if anything it'll save a few clocks spent on
> nice()ing threads and other such things.  Will this improve
> START_DEBIT at all?

Yeah, START_DEBIT only affects a thread once.

>   I've attached the beta patch if you want to try
> it.  Note this also works with 2) as well, so it adds yet another
> dimension to what's mentioned below.
> 
> 2) We recently implemented a new threading model which may be
> interesting to test as well.  This threading model gives worse
> compression *and* performance, but has one benefit: it adds zero
> latency, whereas normal threading adds a full frame of latency per
> thread.  This was paid for by a company interested in
> ultra-low-latency streaming applications, where 1 millisecond is a
> huge deal.  I've been thinking this might be interesting to bench from
> a kernel perspective as well, as when you're spawning a half-dozen
> threads and need them all done within 6 milliseconds, you start
> getting down to serious scheduler issues.
> 
> The new threading model is much less complex than the regular one and
> works as follows.  The frame is split into X slices, and each slice
> encoded with one thread.  Specifically, it works via the following
> process:
> 
> 1.  Preprocess input frame, perform lookahead analysis on input frame
> (all singlethreaded)
> 2.  Split up a ton of threads to do the main encode, one per slice.
> 3.  Join all the threads.
> 4.  Do post-filtering on the output frame, return.
> 
> Clearly this is an utter disaster, since it spawns N times as many
> threads as the old threading model *and* they last far shorter, *and*
> only part of the application is multithreaded.  But there's not really
> a better way to do low-latency threading, and it's an interesting
> challenge to boot.  IIRC, it's also the way ffmpeg's encoder threading
> works.  It's widely considered an inferior model, but as mentioned
> before, in this particular use-case there's no choice.
> 
> To enable this, use --sliced-threads.  I'd recommend using a
> higher-resolution clip for this, as it performs atrociously bad on
> very low resolution videos for reasons you might be able to guess.  If
> you need a higher-res clip, check the SD or HD ones here:
> http://media.xiph.org/video/derf/ .

In another 8 hrs 24 min, I'll have a sunflower to stare at.

> I'm personally curious as to what kind of scheduler issues this
> results in--I haven't done any BFS vs CFS tests with this option
> enabled yet.

I'll look for x264 source, and patch/piddle.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 12:49               ` Mike Galbraith
@ 2009-12-18 13:06                 ` Ingo Molnar
  2009-12-18 13:36                   ` Mike Galbraith
  2009-12-18 13:53                 ` Mike Galbraith
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-12-18 13:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jason Garrett-Glaser, Kasper Sandberg, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds


* Mike Galbraith <efault@gmx.de> wrote:

> > I'm personally curious as to what kind of scheduler issues this results 
> > in--I haven't done any BFS vs CFS tests with this option enabled yet.
> 
> I'll look for x264 source, and patch/piddle.

btw., would be nice to look at it via tools/perf/ as well:

  perf stat --repeat 3 ...

to see the basic hardware utilization (cycles/cache-misses, branch execution 
rate, instructions, etc.) and the basic parallelism metrics, at a glance.

i suspect "perf stat -e L1-icache-loads -e L1-icache-load-misses" would give 
us an even more detailed picture.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 13:06                 ` Ingo Molnar
@ 2009-12-18 13:36                   ` Mike Galbraith
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-12-18 13:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jason Garrett-Glaser, Kasper Sandberg, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Fri, 2009-12-18 at 14:06 +0100, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > > I'm personally curious as to what kind of scheduler issues this results 
> > > in--I haven't done any BFS vs CFS tests with this option enabled yet.
> > 
> > I'll look for x264 source, and patch/piddle.
> 
> btw., would be nice to look at it via tools/perf/ as well:
> 
>   perf stat --repeat 3 ...
> 
> to see the basic hardware utilization (cycles/cache-misses, branch execution 
> rate, instructions, etc.) and the basic parallelism metrics, at a glance.
> 
> i suspect "perf stat -e L1-icache-loads -e L1-icache-load-misses" would give 
> us an even more detailed picture.

Almost virgin v2.6.32-10468-g020307d running 'medium'.

encoded 600 frames, 36.52 fps, 13003.54 kb/s

 Performance counter stats for './x264.sh 8' (3 runs):

   63742.218844  task-clock-msecs         #      3.870 CPUs    ( +-   0.016% )
          42593  context-switches         #      0.001 M/sec   ( +-   0.487% )
           3011  CPU-migrations           #      0.000 M/sec   ( +-   0.417% )
          12862  page-faults              #      0.000 M/sec   ( +-   0.004% )
   151734450892  cycles                   #   2380.439 M/sec   ( +-   1.947% )  (scaled from 71.44%)
   205642315207  instructions             #      1.355 IPC     ( +-   0.085% )  (scaled from 80.68%)
    16274905932  branches                 #    255.324 M/sec   ( +-   0.080% )  (scaled from 80.67%)
     1257135617  branch-misses            #      7.724 %       ( +-   0.255% )  (scaled from 80.06%)
     3116653323  cache-references         #     48.895 M/sec   ( +-   0.340% )  (scaled from 23.78%)
       50823973  cache-misses             #      0.797 M/sec   ( +-   1.400% )  (scaled from 23.76%)

   16.470164901  seconds time elapsed   ( +-   0.079% )

encoded 600 frames, 36.58 fps, 13003.54 kb/s

 Performance counter stats for './x264.sh 8' (3 runs):

   133692266953  L1-icache-loads            ( +-   0.027% )
      997371592  L1-icache-load-misses      ( +-   0.009% )

   16.407060367  seconds time elapsed   ( +-   0.036% )



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 12:49               ` Mike Galbraith
  2009-12-18 13:06                 ` Ingo Molnar
@ 2009-12-18 13:53                 ` Mike Galbraith
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-12-18 13:53 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Ingo Molnar, Kasper Sandberg, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds

On Fri, 2009-12-18 at 13:49 +0100, Mike Galbraith wrote:

> I'll look for x264 source, and patch/piddle.

encoder/encoder.c: In function ‘x264_slice_write’:
encoder/encoder.c:1813: error: ‘x264_t’ has no member named ‘i_threads’
make: *** [encoder/encoder.o] Error 1

marge:..src/x264 # git remote -v
origin  git://git.videolan.org/x264.git (fetch)
origin  git://git.videolan.org/x264.git (push)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18  7:30           ` Mike Galbraith
  2009-12-18 10:11             ` Jason Garrett-Glaser
@ 2009-12-18 10:57             ` Kasper Sandberg
  2009-12-18 11:05               ` Jason Garrett-Glaser
  1 sibling, 1 reply; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-18 10:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Jason Garrett-Glaser, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> 
> > Having said that, we generally try to make things perform well without apps 
> > having to switch themselves to SCHED_BATCH. Mike, do you think we can make 
> > x264 perform as well (or nearly as well) under SCHED_OTHER as under 
> > SCHED_BATCH?
> 
> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
> biggest problem there.  I don't think SCHED_OTHER will ever match
> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> tested.  This load really wants RR scheduling, and wakeup preemption
> necessarily perturbs run order.
> 
> I'll probably piddle with it some more, it's an interesting load.
Yes, i must say, very interresting, its very complicated and... oh wait,
its just encoding a movie!
> 
> 	-Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 10:57             ` Kasper Sandberg
@ 2009-12-18 11:05               ` Jason Garrett-Glaser
  2009-12-19  1:08                 ` Con Kolivas
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Garrett-Glaser @ 2009-12-18 11:05 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Mike Galbraith, Ingo Molnar, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds

On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
>> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
>>
>> > Having said that, we generally try to make things perform well without apps
>> > having to switch themselves to SCHED_BATCH. Mike, do you think we can make
>> > x264 perform as well (or nearly as well) under SCHED_OTHER as under
>> > SCHED_BATCH?
>>
>> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
>> biggest problem there.  I don't think SCHED_OTHER will ever match
>> SCHED_BATCH for this load, though I must say I haven't full-spectrum
>> tested.  This load really wants RR scheduling, and wakeup preemption
>> necessarily perturbs run order.
>>
>> I'll probably piddle with it some more, it's an interesting load.
> Yes, i must say, very interresting, its very complicated and... oh wait,
> its just encoding a movie!

Your trolling is becoming a bit over-the-top at this point.  You
should also considering replying to multiple people in one email as
opposed to spamming a whole bunch in sequence.

Perhaps as the lead x264 developer I'm qualified to say that it
certainly is a very complicated load due to the strict ordering
requirements of the threading model--and that you should tone down the
whining just a tad and perhaps read a bit more about how BFS and CFS
work before complaining about them.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18 11:05               ` Jason Garrett-Glaser
@ 2009-12-19  1:08                 ` Con Kolivas
  2009-12-19  4:03                   ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Con Kolivas @ 2009-12-19  1:08 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Kasper Sandberg, Mike Galbraith, Ingo Molnar, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote:
> On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> >> > Having said that, we generally try to make things perform well without
> >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we
> >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as
> >> > under SCHED_BATCH?
> >>
> >> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
> >> biggest problem there.  I don't think SCHED_OTHER will ever match
> >> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> >> tested.  This load really wants RR scheduling, and wakeup preemption
> >> necessarily perturbs run order.
> >>
> >> I'll probably piddle with it some more, it's an interesting load.
> >
> > Yes, i must say, very interresting, its very complicated and... oh wait,
> > its just encoding a movie!
> 
> Your trolling is becoming a bit over-the-top at this point.  You
> should also considering replying to multiple people in one email as
> opposed to spamming a whole bunch in sequence.
> 
> Perhaps as the lead x264 developer I'm qualified to say that it
> certainly is a very complicated load due to the strict ordering
> requirements of the threading model--and that you should tone down the
> whining just a tad and perhaps read a bit more about how BFS and CFS
> work before complaining about them.

Your workload is interesting because it is a well written real world 
application with a solid threading model written in a cross platform portable 
way.  Your code is valuable as a measure for precisely this reason, and 
there's a trap in trying to program in a way that "the scheduler might like". 
That's presumably what Kasper is trying to point out, albeit in a much blunter 
fashion.

The only workloads I'm remotely interested in are real world workloads 
involving real applications like yours, software compilation, video playback, 
audio playback, gaming, apache page serving, mysql performance and so on that 
people in the real world use on real hardware all day every day. These are, of 
course, measurable even above and beyond the elusive and impossible to measure 
and quantify interactivity and responsiveness.

I couldn't care less about some artificial benchmark involving LTP, timing 
mplayer playing in the presence of 100,000 pipes, volanomark which is just a 
sched_yield benchmark, dbench and hackbench which even their original 
programmers don't like them being used as a meaningful measure, and so on, and 
normal users should also not care about the values returned by these artificial 
benchmarks when they bear no resemblance to their real world performance cases 
as above. 

I have zero interest in adding any "tweaks" to BFS to perform well in X 
benchmark, for there be a path where dragons lie. I've always maintained that, 
and still stick to it, that the more tweaks you add for corner cases, the more 
corner cases you introduce yourself. BFS will remain for a targeted audience 
and I care not to appeal to any artificial benchmarketing obsessed population 
that drives mainline, since I don't -have- to. Mainline can do what it wants, 
and hopefully uses BFS as a yardstick for comparison when appropriate.

Regards,
-- 
-ck

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-19  1:08                 ` Con Kolivas
@ 2009-12-19  4:03                   ` Mike Galbraith
  2009-12-19 17:36                     ` Kasper Sandberg
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-12-19  4:03 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Jason Garrett-Glaser, Kasper Sandberg, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote:
> On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote:
> > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> > >> > Having said that, we generally try to make things perform well without
> > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we
> > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as
> > >> > under SCHED_BATCH?
> > >>
> > >> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
> > >> biggest problem there.  I don't think SCHED_OTHER will ever match
> > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> > >> tested.  This load really wants RR scheduling, and wakeup preemption
> > >> necessarily perturbs run order.
> > >>
> > >> I'll probably piddle with it some more, it's an interesting load.
> > >
> > > Yes, i must say, very interresting, its very complicated and... oh wait,
> > > its just encoding a movie!
> > 
> > Your trolling is becoming a bit over-the-top at this point.  You
> > should also considering replying to multiple people in one email as
> > opposed to spamming a whole bunch in sequence.
> > 
> > Perhaps as the lead x264 developer I'm qualified to say that it
> > certainly is a very complicated load due to the strict ordering
> > requirements of the threading model--and that you should tone down the
> > whining just a tad and perhaps read a bit more about how BFS and CFS
> > work before complaining about them.
> 
> Your workload is interesting because it is a well written real world 
> application with a solid threading model written in a cross platform portable 
> way.  Your code is valuable as a measure for precisely this reason, and 
> there's a trap in trying to program in a way that "the scheduler might like". 
> That's presumably what Kasper is trying to point out, albeit in a much blunter 
> fashion.

If using a different kernel facility gives better results, go for what
works best.  Programmers have been doing that since day one.  I doubt
you'd call it a trap to trade a pipe for a socketpair if one produced
better results than the other.

Mind you, we should be able to better service the load with plain
SCHED_OTHER, no argument there.

> The only workloads I'm remotely interested in are real world workloads 
> involving real applications like yours, software compilation, video playback, 
> audio playback, gaming, apache page serving, mysql performance and so on that 
> people in the real world use on real hardware all day every day. These are, of 
> course, measurable even above and beyond the elusive and impossible to measure 
> and quantify interactivity and responsiveness.
> 
> I couldn't care less about some artificial benchmark involving LTP, timing 
> mplayer playing in the presence of 100,000 pipes, volanomark which is just a 
> sched_yield benchmark, dbench and hackbench which even their original 
> programmers don't like them being used as a meaningful measure, and so on, and 
> normal users should also not care about the values returned by these artificial 
> benchmarks when they bear no resemblance to their real world performance cases 
> as above.

I find all programs interesting and valid in their own right, whether
they be a benchmark or not, though I agree that vmark and hackbench are
a bit over the top.

> I have zero interest in adding any "tweaks" to BFS to perform well in X 
> benchmark, for there be a path where dragons lie. I've always maintained that, 
> and still stick to it, that the more tweaks you add for corner cases, the more 
> corner cases you introduce yourself. BFS will remain for a targeted audience 
> and I care not to appeal to any artificial benchmarketing obsessed population 
> that drives mainline, since I don't -have- to. Mainline can do what it wants, 
> and hopefully uses BFS as a yardstick for comparison when appropriate.

Interesting rant.  IMO, benchmarks are all merely programs that do some
work and quantify.  Whether you like what they measure or not, whether
they emit flattering numbers or not, they can all tell you something if
you're willing to listen.

Oh, and for the record, timing mplayer thing was NOT in the presence of
100000 pipes, it was in the presence of one cpu hog, as was the time
amarok loading thing.  Those were UP tests showing you a weakness.  All
of the results I sent you were intended to show you areas that could use
some improvement, but you don't want to hear, so label and hand-wave.

Below is a quote of the results I sent you.

<quote>

I've taken BFS out for a few spins while looking into BFS vs CFS latency
reports, and noticed a couple problems I'll share, comparison testing
has been healthy for CFS, so maybe BFS can profit as well.  Below are
some bfs304 vs my working tree numbers from a run this morning, looking
to see if some issues seen in earlier releases were still present.

Comments on noted issues: 

It looks like there may be some affinity troubles, and there definitely
seems to be a fairness bug still lurking.  No idea what's up with that,
but see data below, it's pretty nasty.  Any sleepy load competing with a
pure hog seems to be troublesome. 

The pgsql+oltp test data is very interesting to me, pgsql+oltp hates
preemption with a passion, because of it's USERLAND spinlocks.  Preempt
the lock holder, and watch the fun.  Your preemption model suits it very
well at the low end, and does pretty well all the way though.  Really
interesting to me is the difference in 1 and 2 client throughput, why
I'm including these.

msql+oltp and tbench look like they're griping about affinity to me, but
I haven't instrumented anything, so can't be sure.  mysql+oltp I know is
a wakeup preemption and is very affinity sensitive.  Too little wakeup
preemption, it suffers, any load balancing, it suffers.

What vmark is so upset about, I have no idea.  I know it's very affinity
sensitive, and hates wakeup preemption passionately.

Numbers:

vmark
tip           108841 messages per second 
tip++         116260 messages per second
31.bfs304      28279 messages per second

tbench 8
tip           938.421 MB/sec 8 procs
tip++         952.302 MB/sec 8 procs
31.bfs304     709.121 MB/sec 8 procs

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip           9999.36   18493.54   34652.91   34253.13   32057.64   30297.43   28300.96   25450.14   20675.99
tip++        10041.16   18531.16   34934.22   34192.65   32829.65   32010.55   30341.31   27340.65   22724.87
31.bfs304     9459.85   14952.44   32209.07   29724.03   28608.02   27051.10   24851.44   21223.15   15809.46

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          13577.63   26510.67   51871.05   51374.62   50190.69   45494.64   37173.83   27767.09   22795.23
tip++        13685.69   26693.42   52056.45   51733.30   50854.75   49790.95   48972.02   47517.34   44999.22
31.bfs304    15467.03   21126.57   52673.76   50972.41   49652.54   46015.73   44567.18   40419.90   33276.67

fairness bug in 31.bfs304?

prep:
set CPU governor to performance first, as in all benchmarking.
taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy)
taskset -p 0x1 `pidof Xorg`

perf stat taskset -c 0 konsole -e exit
31.bfs304    2.073724549  seconds time elapsed
tip++        0.989323860  seconds time elapsed

note: amarok pins itself to CPU0, and is set up to use mysql database.

prep: cache warmup run.
perf stat amarokapp (quit after 12000 song mp3 collection is loaded)

31.bfs304    136.418518486  seconds time elapsed
tip++         19.439268066  seconds time elapsed

prep: restart amarok, wait for load, start playing

perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie)
31.bfs304    432.712500554  seconds time elapsed
tip++        363.622519583  seconds time elapsed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-19  4:03                   ` Mike Galbraith
@ 2009-12-19 17:36                     ` Kasper Sandberg
  2009-12-19 20:57                       ` Mike Galbraith
  2009-12-20  3:22                       ` Andres Freund
  0 siblings, 2 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-19 17:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Sat, 2009-12-19 at 05:03 +0100, Mike Galbraith wrote:
> On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote:
> > On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote:
> > > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> > > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> > > >> > Having said that, we generally try to make things perform well without
> > > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we
> > > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as
> > > >> > under SCHED_BATCH?
> > > >>
> > > >> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
> > > >> biggest problem there.  I don't think SCHED_OTHER will ever match
> > > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> > > >> tested.  This load really wants RR scheduling, and wakeup preemption
> > > >> necessarily perturbs run order.
> > > >>
> > > >> I'll probably piddle with it some more, it's an interesting load.
> > > >
> > > > Yes, i must say, very interresting, its very complicated and... oh wait,
> > > > its just encoding a movie!
> > > 
> > > Your trolling is becoming a bit over-the-top at this point.  You
> > > should also considering replying to multiple people in one email as
> > > opposed to spamming a whole bunch in sequence.
> > > 
> > > Perhaps as the lead x264 developer I'm qualified to say that it
> > > certainly is a very complicated load due to the strict ordering
> > > requirements of the threading model--and that you should tone down the
> > > whining just a tad and perhaps read a bit more about how BFS and CFS
> > > work before complaining about them.
> > 
> > Your workload is interesting because it is a well written real world 
> > application with a solid threading model written in a cross platform portable 
> > way.  Your code is valuable as a measure for precisely this reason, and 
> > there's a trap in trying to program in a way that "the scheduler might like". 
> > That's presumably what Kasper is trying to point out, albeit in a much blunter 
> > fashion.
> 
> If using a different kernel facility gives better results, go for what
> works best.  Programmers have been doing that since day one.  I doubt
> you'd call it a trap to trade a pipe for a socketpair if one produced
> better results than the other.

Ofcourse in this case that is what performs best one a single
scheduler...

> 
> Mind you, we should be able to better service the load with plain
> SCHED_OTHER, no argument there.
Great, so when you said "i dont think it will get better"(or words to
that effect), that didnt mean anything?
> 
> > The only workloads I'm remotely interested in are real world workloads 
> > involving real applications like yours, software compilation, video playback, 
> > audio playback, gaming, apache page serving, mysql performance and so on that 
> > people in the real world use on real hardware all day every day. These are, of 
> > course, measurable even above and beyond the elusive and impossible to measure 
> > and quantify interactivity and responsiveness.
> > 
> > I couldn't care less about some artificial benchmark involving LTP, timing 
> > mplayer playing in the presence of 100,000 pipes, volanomark which is just a 
> > sched_yield benchmark, dbench and hackbench which even their original 
> > programmers don't like them being used as a meaningful measure, and so on, and 
> > normal users should also not care about the values returned by these artificial 
> > benchmarks when they bear no resemblance to their real world performance cases 
> > as above.
> 
> I find all programs interesting and valid in their own right, whether
> they be a benchmark or not, though I agree that vmark and hackbench are
> a bit over the top.

Yes.. its interresting to SEE, whether its relevant and something to
care about is entirely different.

Yes, its very interresting that something craps out, now, this thing is
_NEVER_ going to occur in real life, and if it happens to do by some
magical christmas fluke, then that is fortunately only ONE time you're
seeing that problem, and as such, its irellevant, and certainly doesnt
merit workarounds which makes other very common stuff perform
significantly worse.

> 
> > I have zero interest in adding any "tweaks" to BFS to perform well in X 
> > benchmark, for there be a path where dragons lie. I've always maintained that, 
> > and still stick to it, that the more tweaks you add for corner cases, the more 
> > corner cases you introduce yourself. BFS will remain for a targeted audience 
> > and I care not to appeal to any artificial benchmarketing obsessed population 
> > that drives mainline, since I don't -have- to. Mainline can do what it wants, 
> > and hopefully uses BFS as a yardstick for comparison when appropriate.
> 
> Interesting rant.  IMO, benchmarks are all merely programs that do some
> work and quantify.  Whether you like what they measure or not, whether
> they emit flattering numbers or not, they can all tell you something if
> you're willing to listen.

I suspect con is very interrested in listening, however, as he have
stated, if fixing some corner case in an artificial load requires
damaging a realworld load, that is an unacceptable modification to him,
and I agree. I ask you this, would you rather some artificial benchmark
ran better, but your own everyday applications ran slower as a result?
It seems to me you do, which i can not understand.

> 
> Oh, and for the record, timing mplayer thing was NOT in the presence of
> 100000 pipes, it was in the presence of one cpu hog, as was the time
> amarok loading thing.  Those were UP tests showing you a weakness.  All
> of the results I sent you were intended to show you areas that could use
> some improvement, but you don't want to hear, so label and hand-wave.
> 
> Below is a quote of the results I sent you.
> 
> <quote>
> 
> I've taken BFS out for a few spins while looking into BFS vs CFS latency
> reports, and noticed a couple problems I'll share, comparison testing
> has been healthy for CFS, so maybe BFS can profit as well.  Below are
> some bfs304 vs my working tree numbers from a run this morning, looking
> to see if some issues seen in earlier releases were still present.
> 
> Comments on noted issues: 
> 
> It looks like there may be some affinity troubles, and there definitely
> seems to be a fairness bug still lurking.  No idea what's up with that,
> but see data below, it's pretty nasty.  Any sleepy load competing with a
> pure hog seems to be troublesome. 
> 
> The pgsql+oltp test data is very interesting to me, pgsql+oltp hates
> preemption with a passion, because of it's USERLAND spinlocks.  Preempt
> the lock holder, and watch the fun.  Your preemption model suits it very
> well at the low end, and does pretty well all the way though.  Really
> interesting to me is the difference in 1 and 2 client throughput, why
> I'm including these.
> 
> msql+oltp and tbench look like they're griping about affinity to me, but
> I haven't instrumented anything, so can't be sure.  mysql+oltp I know is
> a wakeup preemption and is very affinity sensitive.  Too little wakeup
> preemption, it suffers, any load balancing, it suffers.
> 
> What vmark is so upset about, I have no idea.  I know it's very affinity
> sensitive, and hates wakeup preemption passionately.
> 
> Numbers:
> 
> vmark
> tip           108841 messages per second 
> tip++         116260 messages per second
> 31.bfs304      28279 messages per second
> 
> tbench 8
> tip           938.421 MB/sec 8 procs
> tip++         952.302 MB/sec 8 procs
> 31.bfs304     709.121 MB/sec 8 procs
> 
> mysql+oltp
> clients             1          2          4          8         16         32         64        128        256
> tip           9999.36   18493.54   34652.91   34253.13   32057.64   30297.43   28300.96   25450.14   20675.99
> tip++        10041.16   18531.16   34934.22   34192.65   32829.65   32010.55   30341.31   27340.65   22724.87
> 31.bfs304     9459.85   14952.44   32209.07   29724.03   28608.02   27051.10   24851.44   21223.15   15809.46
> 
> pgsql+oltp
> clients             1          2          4          8         16         32         64        128        256
> tip          13577.63   26510.67   51871.05   51374.62   50190.69   45494.64   37173.83   27767.09   22795.23
> tip++        13685.69   26693.42   52056.45   51733.30   50854.75   49790.95   48972.02   47517.34   44999.22
> 31.bfs304    15467.03   21126.57   52673.76   50972.41   49652.54   46015.73   44567.18   40419.90   33276.67
> 
> fairness bug in 31.bfs304?
> 
> prep:
> set CPU governor to performance first, as in all benchmarking.
> taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy)
> taskset -p 0x1 `pidof Xorg`
> 
> perf stat taskset -c 0 konsole -e exit
> 31.bfs304    2.073724549  seconds time elapsed
> tip++        0.989323860  seconds time elapsed
> 
> note: amarok pins itself to CPU0, and is set up to use mysql database.
> 
> prep: cache warmup run.
> perf stat amarokapp (quit after 12000 song mp3 collection is loaded)
> 
> 31.bfs304    136.418518486  seconds time elapsed
> tip++         19.439268066  seconds time elapsed
> 
> prep: restart amarok, wait for load, start playing
> 
> perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie)
> 31.bfs304    432.712500554  seconds time elapsed
> tip++        363.622519583  seconds time elapsed
> 

But presumably the cpu hog is running at same priority, and if this is
done on a UP system, that will obviously mean fairness will make stuff
slower..

Try this on a dualcore or quadcore system, or ofcourse just set the
niceness accordingly...

> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-19 17:36                     ` Kasper Sandberg
@ 2009-12-19 20:57                       ` Mike Galbraith
  2009-12-20  3:22                       ` Andres Freund
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2009-12-19 20:57 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Con Kolivas, Jason Garrett-Glaser, Ingo Molnar, Peter Zijlstra,
	LKML Mailinglist, Linus Torvalds

On Sat, 2009-12-19 at 18:36 +0100, Kasper Sandberg wrote:
> On Sat, 2009-12-19 at 05:03 +0100, Mike Galbraith wrote:
> > On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote:

> > > Your workload is interesting because it is a well written real world 
> > > application with a solid threading model written in a cross platform portable 
> > > way.  Your code is valuable as a measure for precisely this reason, and 
> > > there's a trap in trying to program in a way that "the scheduler might like". 
> > > That's presumably what Kasper is trying to point out, albeit in a much blunter 
> > > fashion.
> > 
> > If using a different kernel facility gives better results, go for what
> > works best.  Programmers have been doing that since day one.  I doubt
> > you'd call it a trap to trade a pipe for a socketpair if one produced
> > better results than the other.
> 
> Ofcourse in this case that is what performs best one a single
> scheduler...

I have no idea what you're talking about here.
 
> > Mind you, we should be able to better service the load with plain
> > SCHED_OTHER, no argument there.
> Great, so when you said "i dont think it will get better"(or words to
> that effect), that didnt mean anything?

Or here.

Look.  BFS handles this load well, a little better than CFS in fact.  I
don't have a problem with that, but you seem to think it's a big hairy
deal for some strange reason.
 
> > > The only workloads I'm remotely interested in are real world workloads 
> > > involving real applications like yours, software compilation, video playback, 
> > > audio playback, gaming, apache page serving, mysql performance and so on that 
> > > people in the real world use on real hardware all day every day. These are, of 
> > > course, measurable even above and beyond the elusive and impossible to measure 
> > > and quantify interactivity and responsiveness.
> > > 
> > > I couldn't care less about some artificial benchmark involving LTP, timing 
> > > mplayer playing in the presence of 100,000 pipes, volanomark which is just a 
> > > sched_yield benchmark, dbench and hackbench which even their original 
> > > programmers don't like them being used as a meaningful measure, and so on, and 
> > > normal users should also not care about the values returned by these artificial 
> > > benchmarks when they bear no resemblance to their real world performance cases 
> > > as above.
> > 
> > I find all programs interesting and valid in their own right, whether
> > they be a benchmark or not, though I agree that vmark and hackbench are
> > a bit over the top.
> 
> Yes.. its interresting to SEE, whether its relevant and something to
> care about is entirely different.
> 
> Yes, its very interresting that something craps out, now, this thing is
> _NEVER_ going to occur in real life, and if it happens to do by some
> magical christmas fluke, then that is fortunately only ONE time you're
> seeing that problem, and as such, its irellevant, and certainly doesnt
> merit workarounds which makes other very common stuff perform
> significantly worse.

Haven't you noticed yet that nobody but you and Con has suggested any
course of action whatsoever?  That it is you two who both mention then
condemn workarounds and load specific tweaks all in the same breath with
not one word having come from any other source?

> > > I have zero interest in adding any "tweaks" to BFS to perform well in X 
> > > benchmark, for there be a path where dragons lie. I've always maintained that, 
> > > and still stick to it, that the more tweaks you add for corner cases, the more 
> > > corner cases you introduce yourself. BFS will remain for a targeted audience 
> > > and I care not to appeal to any artificial benchmarketing obsessed population 
> > > that drives mainline, since I don't -have- to. Mainline can do what it wants, 
> > > and hopefully uses BFS as a yardstick for comparison when appropriate.
> > 
> > Interesting rant.  IMO, benchmarks are all merely programs that do some
> > work and quantify.  Whether you like what they measure or not, whether
> > they emit flattering numbers or not, they can all tell you something if
> > you're willing to listen.
> 
> I suspect con is very interrested in listening, however, as he have
> stated, if fixing some corner case in an artificial load requires
> damaging a realworld load, that is an unacceptable modification to him,
> and I agree. I ask you this, would you rather some artificial benchmark
> ran better, but your own everyday applications ran slower as a result?
> It seems to me you do, which i can not understand.

You can hand-wave all you want, I really do not care, but kindly keep
your words out of my mouth.
 
> > fairness bug in 31.bfs304?
> > 
> > prep:
> > set CPU governor to performance first, as in all benchmarking.
> > taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy)
> > taskset -p 0x1 `pidof Xorg`
> > 
> > perf stat taskset -c 0 konsole -e exit
> > 31.bfs304    2.073724549  seconds time elapsed
> > tip++        0.989323860  seconds time elapsed
> > 
> > note: amarok pins itself to CPU0, and is set up to use mysql database.
> > 
> > prep: cache warmup run.
> > perf stat amarokapp (quit after 12000 song mp3 collection is loaded)
> > 
> > 31.bfs304    136.418518486  seconds time elapsed
> > tip++         19.439268066  seconds time elapsed
> > 
> > prep: restart amarok, wait for load, start playing
> > 
> > perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie)
> > 31.bfs304    432.712500554  seconds time elapsed
> > tip++        363.622519583  seconds time elapsed
> > 
> 
> But presumably the cpu hog is running at same priority, and if this is
> done on a UP system, that will obviously mean fairness will make stuff
> slower..
> 
> Try this on a dualcore or quadcore system, or ofcourse just set the
> niceness accordingly...

Amazing that you can actually say that with a straight face.

Look.  You can hand-wave all results into irrelevance, I do not care.
You've both made it perfectly clear that test results are not welcome.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-19 17:36                     ` Kasper Sandberg
  2009-12-19 20:57                       ` Mike Galbraith
@ 2009-12-20  3:22                       ` Andres Freund
  2009-12-20 12:10                         ` Kasper Sandberg
  1 sibling, 1 reply; 34+ messages in thread
From: Andres Freund @ 2009-12-20  3:22 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Mike Galbraith, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote:
> Try this on a dualcore or quadcore system, or ofcourse just set the<
> niceness accordingly...
Oh well. This is getting too much for a normally very silent and flame fearing 
reader. Didnt *you* just tell others to shut up about using any tunables for 
any application? And that you dont need any tunables for BFS?

Andres

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-20  3:22                       ` Andres Freund
@ 2009-12-20 12:10                         ` Kasper Sandberg
  2009-12-20 13:09                           ` Kasper Sandberg
  2009-12-20 15:13                           ` Mike Galbraith
  0 siblings, 2 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-20 12:10 UTC (permalink / raw)
  To: Andres Freund
  Cc: Mike Galbraith, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Sun, 2009-12-20 at 04:22 +0100, Andres Freund wrote:
> On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote:
> > Try this on a dualcore or quadcore system, or ofcourse just set the<
> > niceness accordingly...
> Oh well. This is getting too much for a normally very silent and flame fearing 
> reader. Didnt *you* just tell others to shut up about using any tunables for 
> any application? And that you dont need any tunables for BFS?

That was an entirely different case, have you even been following the
thread?

OFCOURSE you're going to see slowdowns on a UP system if you have a cpu
hog and then run something else, this is the only behavior possible, and
bfs handles it in a fair way.

when i said we needed no tunables, that was for running a _SINGLE_
application, and then measuring said applications performance. (where
BFS indeed does beat CFS by a quite large margin)

and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
"completely FAIR scheduler" ? or is that just the marketing name?

> 
> Andres

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-20 12:10                         ` Kasper Sandberg
@ 2009-12-20 13:09                           ` Kasper Sandberg
  2009-12-20 15:13                           ` Mike Galbraith
  1 sibling, 0 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-20 13:09 UTC (permalink / raw)
  To: Andres Freund
  Cc: Mike Galbraith, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote:
> On Sun, 2009-12-20 at 04:22 +0100, Andres Freund wrote:
> > On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote:
> > > Try this on a dualcore or quadcore system, or ofcourse just set the<
> > > niceness accordingly...
> > Oh well. This is getting too much for a normally very silent and flame fearing 
> > reader. Didnt *you* just tell others to shut up about using any tunables for 
> > any application? And that you dont need any tunables for BFS?
oh and btw, the niceness is not really a tunable"
> 
> That was an entirely different case, have you even been following the
> thread?
> 
> OFCOURSE you're going to see slowdowns on a UP system if you have a cpu
> hog and then run something else, this is the only behavior possible, and
> bfs handles it in a fair way.
> 
> when i said we needed no tunables, that was for running a _SINGLE_
> application, and then measuring said applications performance. (where
> BFS indeed does beat CFS by a quite large margin)
> 
> and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
> "completely FAIR scheduler" ? or is that just the marketing name?
> 
> 
> 
> > 
> > Andres
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-20 12:10                         ` Kasper Sandberg
  2009-12-20 13:09                           ` Kasper Sandberg
@ 2009-12-20 15:13                           ` Mike Galbraith
  2009-12-20 15:51                             ` Mike Galbraith
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-12-20 15:13 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Andres Freund, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote:

> and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
> "completely FAIR scheduler" ? or is that just the marketing name?

Clue:  CFS _did_ distribute CPU evenly.  Ponder that for a moment.

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-20 15:13                           ` Mike Galbraith
@ 2009-12-20 15:51                             ` Mike Galbraith
  2009-12-22  7:33                               ` Jason Garrett-Glaser
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2009-12-20 15:51 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Andres Freund, Con Kolivas, Jason Garrett-Glaser, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist, Linus Torvalds

On Sun, 2009-12-20 at 16:13 +0100, Mike Galbraith wrote:
> On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote:
> 
> > and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
> > "completely FAIR scheduler" ? or is that just the marketing name?
> 
> Clue:  CFS _did_ distribute CPU evenly.  Ponder that for a moment.

All done?

Do you think THAT may be why I thought Con might be interested?!?

	-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-20 15:51                             ` Mike Galbraith
@ 2009-12-22  7:33                               ` Jason Garrett-Glaser
  2009-12-22  7:39                                 ` Jason Garrett-Glaser
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Garrett-Glaser @ 2009-12-22  7:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Kasper Sandberg, Andres Freund, Con Kolivas, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist

Benchmarks for the new threading model are up, along with a few others:

http://doom10.org/index.php?topic=78.0

Interestingly enough, CFS beats BFS on zerolatency by a significant
margin.  Unsurprisingly, given the threading model, the optimal number
of threads is equal to the number of cores.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-22  7:33                               ` Jason Garrett-Glaser
@ 2009-12-22  7:39                                 ` Jason Garrett-Glaser
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Garrett-Glaser @ 2009-12-22  7:39 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Kasper Sandberg, Andres Freund, Con Kolivas, Ingo Molnar,
	Peter Zijlstra, LKML Mailinglist

On Tue, Dec 22, 2009 at 2:33 AM, Jason Garrett-Glaser
<darkshikari@gmail.com> wrote:
> Benchmarks for the new threading model are up, along with a few others:
>
> http://doom10.org/index.php?topic=78.0
>
> Interestingly enough, CFS beats BFS on zerolatency by a significant
> margin.  Unsurprisingly, given the threading model, the optimal number
> of threads is equal to the number of cores.
>
> Jason
>

And I am apparently blind: I cannot read graphs.  Ignore the
conclusion made in the above post ;)

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x264 benchmarks BFS vs CFS
  2009-12-18  1:18       ` Jason Garrett-Glaser
  2009-12-18  5:23         ` Ingo Molnar
@ 2009-12-18 10:56         ` Kasper Sandberg
  1 sibling, 0 replies; 34+ messages in thread
From: Kasper Sandberg @ 2009-12-18 10:56 UTC (permalink / raw)
  To: Jason Garrett-Glaser
  Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, LKML Mailinglist,
	Linus Torvalds

On Thu, 2009-12-17 at 17:18 -0800, Jason Garrett-Glaser wrote:
> On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> >> * Jason Garrett-Glaser <darkshikari@gmail.com> wrote:
> >>
> >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <lkml@metanurb.dk> wrote:
> >> > > well well :) nothing quite speaks out like graphs..
> >> > >
> >> > > http://doom10.org/index.php?topic=78.0
> >> > >
> >> > >
> >> > >
> >> > > regards,
> >> > > Kasper Sandberg
> >> >
> >> > Yeah, I sent this to Mike a bit ago.  Seems that .32 has basically tied
> >> > it--and given the strict thread-ordering expectations of x264, you basically
> >> > can't expect it to do any better, though I'm curious what's responsible for
> >> > the gap in "veryslow", even with SCHED_BATCH enabled.
> >> >
> >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> >> > when we enable SCHED_BATCH.  We're doing some further testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
> 
> Can't individually applications request SCHED_BATCH?  Our plan was to
> have x264 simply detect if it was necessary (once we figure out what
> encoding settings result in the large gap situation) and automatically
> enable it for the current application.
that is an insane solution, especially considering better schedulers
outperform cfs SCHED_BATCH without doing ANYTHING special.

Do you not see what is happening here? it is simply grotesk
> 
> Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-12-22  7:39 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-17  9:33 x264 benchmarks BFS vs CFS Kasper Sandberg
2009-12-17 10:42 ` Jason Garrett-Glaser
2009-12-17 10:53   ` Ingo Molnar
2009-12-17 11:00     ` Kasper Sandberg
2009-12-17 12:08       ` Ingo Molnar
2009-12-17 12:35         ` Kasper Sandberg
2009-12-17 15:47         ` Arjan van de Ven
2009-12-17 13:30       ` Mike Galbraith
2009-12-18 10:54         ` Kasper Sandberg
2009-12-18 11:41           ` Mike Galbraith
2009-12-17 21:22       ` Thomas Fjellstrom
2009-12-18 10:56         ` Kasper Sandberg
2009-12-18  1:18       ` Jason Garrett-Glaser
2009-12-18  5:23         ` Ingo Molnar
2009-12-18  7:30           ` Mike Galbraith
2009-12-18 10:11             ` Jason Garrett-Glaser
2009-12-18 12:49               ` Mike Galbraith
2009-12-18 13:06                 ` Ingo Molnar
2009-12-18 13:36                   ` Mike Galbraith
2009-12-18 13:53                 ` Mike Galbraith
2009-12-18 10:57             ` Kasper Sandberg
2009-12-18 11:05               ` Jason Garrett-Glaser
2009-12-19  1:08                 ` Con Kolivas
2009-12-19  4:03                   ` Mike Galbraith
2009-12-19 17:36                     ` Kasper Sandberg
2009-12-19 20:57                       ` Mike Galbraith
2009-12-20  3:22                       ` Andres Freund
2009-12-20 12:10                         ` Kasper Sandberg
2009-12-20 13:09                           ` Kasper Sandberg
2009-12-20 15:13                           ` Mike Galbraith
2009-12-20 15:51                             ` Mike Galbraith
2009-12-22  7:33                               ` Jason Garrett-Glaser
2009-12-22  7:39                                 ` Jason Garrett-Glaser
2009-12-18 10:56         ` Kasper Sandberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox