* Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
@ 2008-01-14 17:37 Colin Fowler
2008-01-14 18:55 ` Ingo Molnar
0 siblings, 1 reply; 13+ messages in thread
From: Colin Fowler @ 2008-01-14 17:37 UTC (permalink / raw)
To: linux-kernel
Please CC me as I'm not subscribed.
I have (what is to me) a strange and very repeatable slowdown for a
CPU intensive benchmark on my system on newer kernels.
Hardware : Dell Precision 470.
CPU 2x2.0GHz Quad Core Xeon E5335 CPUs
Memory 4GB ECC RAM.
OS Ubuntu x86_64 7.10 (Gutsy Gibbon)
Compiler : gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)
Benchmark : A ray-trace is performed on 500 times on 17 separate
scenes. Workload is distributed by tiling the framebuffer into N 32x32
pixel tiles. Each CPU grabs one of N tiles out of the queue and
repeats until no jobs are left. Rendering is to a shared framebuffer
(obviously this causes problems with caching). Locking and
synchronization is done using pthreads.
Other details: The system is cleanly booted for each run. No I/O is
performed during the timed portions of the test. The benchmark does
however read a model file from the drive and build a data structure
from it before each timed portion.
On the 2.6.22 series of kernels results are pretty much the same. On
2.6.23 series kernels I see a loss in speed of ~2% across the board.
On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
2.6.23 Kernels tested: 23.1, 23.3, 23.13
2.6.24 Kernels tested: 24-rc7
I have my kernel compiled to use the SLAB allocator. All other
tweaking options are set as defaults. My config files are available at
http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
something wrong for the type of work I do?
regards,
Colin
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-14 17:37 Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon Colin Fowler
@ 2008-01-14 18:55 ` Ingo Molnar
2008-01-14 22:42 ` Colin Fowler
0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2008-01-14 18:55 UTC (permalink / raw)
To: Colin Fowler; +Cc: linux-kernel, Peter Zijlstra
* Colin Fowler <elethiomel@gmail.com> wrote:
> Benchmark : A ray-trace is performed on 500 times on 17 separate
> scenes. Workload is distributed by tiling the framebuffer into N 32x32
> pixel tiles. Each CPU grabs one of N tiles out of the queue and
> repeats until no jobs are left. Rendering is to a shared framebuffer
> (obviously this causes problems with caching). Locking and
> synchronization is done using pthreads.
>
> Other details: The system is cleanly booted for each run. No I/O is
> performed during the timed portions of the test. The benchmark does
> however read a model file from the drive and build a data structure
> from it before each timed portion.
>
> On the 2.6.22 series of kernels results are pretty much the same. On
> 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> 2.6.23 Kernels tested: 23.1, 23.3, 23.13
> 2.6.24 Kernels tested: 24-rc7
>
> I have my kernel compiled to use the SLAB allocator. All other
> tweaking options are set as defaults. My config files are available at
> http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> something wrong for the type of work I do?
Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
the value of /proc/sys/kernel/sched_latency_ns - does that make any
difference? Please also run the following script while the ray-trace app
is running:
http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
and send me the output of it, so that we can have an idea about what's
going on in your system during this workload.
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-14 18:55 ` Ingo Molnar
@ 2008-01-14 22:42 ` Colin Fowler
2008-01-14 22:43 ` Colin Fowler
2008-01-15 17:01 ` Colin Fowler
0 siblings, 2 replies; 13+ messages in thread
From: Colin Fowler @ 2008-01-14 22:42 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Hi Ingo, thanks for the reply.
Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
fact made things slightly worse. I used 24-rc7
Your script was only written to run for 15 seconds, so I ran it so it
multiple times so it covered most of the benchmark.
Other issues with these data may be that for much of the benchmark I
am building data structures utilizing at most 1 to 3 cores. I'm not
concerned with these timings personally as this is considered the
offline part of the render. Once these data structures are built I
proceed to render across 8 cores. This is the section of the benchmark
I get my timings from ( I use RDTSC before and after the render
segment). The majority of the overall time taken for a run is
therefore data structure building. I do not time this.
Colin.
On Jan 14, 2008 6:55 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Colin Fowler <elethiomel@gmail.com> wrote:
>
> > Benchmark : A ray-trace is performed on 500 times on 17 separate
> > scenes. Workload is distributed by tiling the framebuffer into N 32x32
> > pixel tiles. Each CPU grabs one of N tiles out of the queue and
> > repeats until no jobs are left. Rendering is to a shared framebuffer
> > (obviously this causes problems with caching). Locking and
> > synchronization is done using pthreads.
> >
> > Other details: The system is cleanly booted for each run. No I/O is
> > performed during the timed portions of the test. The benchmark does
> > however read a model file from the drive and build a data structure
> > from it before each timed portion.
> >
> > On the 2.6.22 series of kernels results are pretty much the same. On
> > 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> > 2.6.23 Kernels tested: 23.1, 23.3, 23.13
> > 2.6.24 Kernels tested: 24-rc7
> >
> > I have my kernel compiled to use the SLAB allocator. All other
> > tweaking options are set as defaults. My config files are available at
> > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> > something wrong for the type of work I do?
>
> Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
> the value of /proc/sys/kernel/sched_latency_ns - does that make any
> difference? Please also run the following script while the ray-trace app
> is running:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> and send me the output of it, so that we can have an idea about what's
> going on in your system during this workload.
>
> Ingo
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-14 22:42 ` Colin Fowler
@ 2008-01-14 22:43 ` Colin Fowler
2008-01-15 17:01 ` Colin Fowler
1 sibling, 0 replies; 13+ messages in thread
From: Colin Fowler @ 2008-01-14 22:43 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Forgot to add that the results are at http://vangogh.cs.tcd.ie/fowler/cfs/
On Jan 14, 2008 10:42 PM, Colin Fowler <elethiomel@gmail.com> wrote:
> Hi Ingo, thanks for the reply.
>
> Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
> fact made things slightly worse. I used 24-rc7
>
> Your script was only written to run for 15 seconds, so I ran it so it
> multiple times so it covered most of the benchmark.
> Other issues with these data may be that for much of the benchmark I
> am building data structures utilizing at most 1 to 3 cores. I'm not
> concerned with these timings personally as this is considered the
> offline part of the render. Once these data structures are built I
> proceed to render across 8 cores. This is the section of the benchmark
> I get my timings from ( I use RDTSC before and after the render
> segment). The majority of the overall time taken for a run is
> therefore data structure building. I do not time this.
>
> Colin.
>
>
>
> On Jan 14, 2008 6:55 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Colin Fowler <elethiomel@gmail.com> wrote:
> >
> > > Benchmark : A ray-trace is performed on 500 times on 17 separate
> > > scenes. Workload is distributed by tiling the framebuffer into N 32x32
> > > pixel tiles. Each CPU grabs one of N tiles out of the queue and
> > > repeats until no jobs are left. Rendering is to a shared framebuffer
> > > (obviously this causes problems with caching). Locking and
> > > synchronization is done using pthreads.
> > >
> > > Other details: The system is cleanly booted for each run. No I/O is
> > > performed during the timed portions of the test. The benchmark does
> > > however read a model file from the drive and build a data structure
> > > from it before each timed portion.
> > >
> > > On the 2.6.22 series of kernels results are pretty much the same. On
> > > 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> > > 2.6.23 Kernels tested: 23.1, 23.3, 23.13
> > > 2.6.24 Kernels tested: 24-rc7
> > >
> > > I have my kernel compiled to use the SLAB allocator. All other
> > > tweaking options are set as defaults. My config files are available at
> > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> > > something wrong for the type of work I do?
> >
> > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
> > the value of /proc/sys/kernel/sched_latency_ns - does that make any
> > difference? Please also run the following script while the ray-trace app
> > is running:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> >
> > and send me the output of it, so that we can have an idea about what's
> > going on in your system during this workload.
> >
> > Ingo
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-14 22:42 ` Colin Fowler
2008-01-14 22:43 ` Colin Fowler
@ 2008-01-15 17:01 ` Colin Fowler
2008-01-15 22:06 ` Ingo Molnar
1 sibling, 1 reply; 13+ messages in thread
From: Colin Fowler @ 2008-01-15 17:01 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
These data may be much better for you. It's a single 15 second data
collection run only when the actual ray-tracing is happening. These
data do not therefore cover the data structure building phase.
http://vangogh.cs.tcd.ie/fowler/cfs2/
Colin
On Jan 14, 2008 10:42 PM, Colin Fowler <elethiomel@gmail.com> wrote:
> Hi Ingo, thanks for the reply.
>
> Modifying /proc/sys/kernel/sched_latency_ns to be double may have in
> fact made things slightly worse. I used 24-rc7
>
> Your script was only written to run for 15 seconds, so I ran it so it
> multiple times so it covered most of the benchmark.
> Other issues with these data may be that for much of the benchmark I
> am building data structures utilizing at most 1 to 3 cores. I'm not
> concerned with these timings personally as this is considered the
> offline part of the render. Once these data structures are built I
> proceed to render across 8 cores. This is the section of the benchmark
> I get my timings from ( I use RDTSC before and after the render
> segment). The majority of the overall time taken for a run is
> therefore data structure building. I do not time this.
>
> Colin.
>
>
>
> On Jan 14, 2008 6:55 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Colin Fowler <elethiomel@gmail.com> wrote:
> >
> > > Benchmark : A ray-trace is performed on 500 times on 17 separate
> > > scenes. Workload is distributed by tiling the framebuffer into N 32x32
> > > pixel tiles. Each CPU grabs one of N tiles out of the queue and
> > > repeats until no jobs are left. Rendering is to a shared framebuffer
> > > (obviously this causes problems with caching). Locking and
> > > synchronization is done using pthreads.
> > >
> > > Other details: The system is cleanly booted for each run. No I/O is
> > > performed during the timed portions of the test. The benchmark does
> > > however read a model file from the drive and build a data structure
> > > from it before each timed portion.
> > >
> > > On the 2.6.22 series of kernels results are pretty much the same. On
> > > 2.6.23 series kernels I see a loss in speed of ~2% across the board.
> > > On 2.6.24-rc7 that loss in speed is perhaps very slightly worse (~3%).
> > > 2.6.22 Kernels tested: 22.9(Ubuntu Stock Kernel), 22.14, 22.15
> > > 2.6.23 Kernels tested: 23.1, 23.3, 23.13
> > > 2.6.24 Kernels tested: 24-rc7
> > >
> > > I have my kernel compiled to use the SLAB allocator. All other
> > > tweaking options are set as defaults. My config files are available at
> > > http://vangogh.cs.tcd.ie/fowler/configs . Perhaps I'm configuring
> > > something wrong for the type of work I do?
> >
> > Could you try CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y and double
> > the value of /proc/sys/kernel/sched_latency_ns - does that make any
> > difference? Please also run the following script while the ray-trace app
> > is running:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
> >
> > and send me the output of it, so that we can have an idea about what's
> > going on in your system during this workload.
> >
> > Ingo
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-15 17:01 ` Colin Fowler
@ 2008-01-15 22:06 ` Ingo Molnar
2008-01-15 23:05 ` Colin Fowler
2008-01-16 17:34 ` Colin Fowler
0 siblings, 2 replies; 13+ messages in thread
From: Ingo Molnar @ 2008-01-15 22:06 UTC (permalink / raw)
To: Colin Fowler; +Cc: linux-kernel, Peter Zijlstra
* Colin Fowler <elethiomel@gmail.com> wrote:
> These data may be much better for you. It's a single 15 second data
> collection run only when the actual ray-tracing is happening. These
> data do not therefore cover the data structure building phase.
>
> http://vangogh.cs.tcd.ie/fowler/cfs2/
hm, the system has considerable idle time left:
r b swpd free buff cache si so bi bo in cs us sy id wa
8 0 0 1201920 683840 1039100 0 0 3 2 27 46 1 0 99 0
2 0 0 1202168 683840 1039112 0 0 0 0 245 45339 80 2 17 0
2 0 0 1202168 683840 1039112 0 0 0 0 263 47349 84 3 14 0
2 0 0 1202300 683848 1039112 0 0 0 76 255 47057 84 3 13 0
and context-switches 45K times a second. Do you know what is going on
there? I thought ray-tracing is something that can be parallelized
pretty efficiently, without having to contend and schedule too much.
could you try to do a similar capture on 2.6.22 as well (under the same
phase of the same workload), as comparison?
there are a handful of 'scheduler feature bits' in
/proc/sys/kernel/sched_features:
enum {
SCHED_FEAT_NEW_FAIR_SLEEPERS = 1,
SCHED_FEAT_WAKEUP_PREEMPT = 2,
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_TREE_AVG = 8,
SCHED_FEAT_APPROX_AVG = 16,
};
const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_NEW_FAIR_SLEEPERS * 1 |
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_TREE_AVG * 0 |
SCHED_FEAT_APPROX_AVG * 0;
[as of 2.6.24-rc7]
could you try to turn some of them off/on. In particular toggling
WAKEUP_PREEMPT might have an effect, and NEW_FAIR_SLEEPERS might have an
effect as well. (TREE_AVG and APPROX_AVG has probably little effect)
other debug-tunables you might want to look into are in the
/proc/sys/kernel/sched_domains hierarchy.
also, if you toggle:
/sys/devices/system/cpu/sched_mc_power_savings
does that change the results?
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-15 22:06 ` Ingo Molnar
@ 2008-01-15 23:05 ` Colin Fowler
2008-01-16 15:35 ` Ingo Molnar
2008-01-16 17:34 ` Colin Fowler
1 sibling, 1 reply; 13+ messages in thread
From: Colin Fowler @ 2008-01-15 23:05 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Hi Ingo,
I'll get the results tomorrow as I'm now out of the office, but
I can perhaps answer some of your queries now.
On Jan 15, 2008 10:06 PM, Ingo Molnar <mingo@elte.hu> wrote:
> hm, the system has considerable idle time left:
>
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 8 0 0 1201920 683840 1039100 0 0 3 2 27 46 1 0 99 0
> 2 0 0 1202168 683840 1039112 0 0 0 0 245 45339 80 2 17 0
> 2 0 0 1202168 683840 1039112 0 0 0 0 263 47349 84 3 14 0
> 2 0 0 1202300 683848 1039112 0 0 0 76 255 47057 84 3 13 0
>
> and context-switches 45K times a second. Do you know what is going on
> there? I thought ray-tracing is something that can be parallelized
> pretty efficiently, without having to contend and schedule too much.
>
This is a RTRT (real-time ray tracing) system and as a result differs
from traditional offline ray-tracers as it is optimised for speed. The
benchmark I ran while these data were collected renders an 80K polygon
scene to a 512x512 buffer at just over 100fps.
The context switches are most likely caused by the pthreads
synchronisation code. There are two mutexs. Each job is a 32x32 tile
and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
100fps) * 2 =~50k. There's very likely where our context switches are
coming from. Larger tile sizes would of course reduce the locking
overhead, but then the ray-tracer suffers form load imbalance as some
tiles are much quicker to render than others. Empircally we've found
that this tile-size works the best for us.
The CPU idling occurs as the system doesn't yet perform asynchronous
rendering. When all tiles in a current job queue are finished the
current frame is done. At this point all worker threads sleep while
the master thread blits the image to the screen and fills the job
queue for the next frame. The data probably shows that one CPU is kept
maxed and the others reach about 90% most of the time. This is
something on my TODO list to fix along with a myriad of other
optimisations :)
regards,
Colin
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-15 23:05 ` Colin Fowler
@ 2008-01-16 15:35 ` Ingo Molnar
2008-01-16 16:10 ` Colin Fowler
0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2008-01-16 15:35 UTC (permalink / raw)
To: Colin Fowler; +Cc: linux-kernel, Peter Zijlstra
* Colin Fowler <elethiomel@gmail.com> wrote:
> > and context-switches 45K times a second. Do you know what is going
> > on there? I thought ray-tracing is something that can be
> > parallelized pretty efficiently, without having to contend and
> > schedule too much.
>
> This is a RTRT (real-time ray tracing) system and as a result differs
> from traditional offline ray-tracers as it is optimised for speed. The
> benchmark I ran while these data were collected renders an 80K polygon
> scene to a 512x512 buffer at just over 100fps.
>
> The context switches are most likely caused by the pthreads
> synchronisation code. There are two mutexs. Each job is a 32x32 tile
> and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
> 100fps) * 2 =~50k. There's very likely where our context switches are
> coming from. Larger tile sizes would of course reduce the locking
> overhead, but then the ray-tracer suffers form load imbalance as some
> tiles are much quicker to render than others. Empircally we've found
> that this tile-size works the best for us.
>
> The CPU idling occurs as the system doesn't yet perform asynchronous
> rendering. When all tiles in a current job queue are finished the
> current frame is done. At this point all worker threads sleep while
> the master thread blits the image to the screen and fills the job
> queue for the next frame. The data probably shows that one CPU is kept
> maxed and the others reach about 90% most of the time. This is
> something on my TODO list to fix along with a myriad of other
> optimisations :)
is this something i could run myself and see how it behaves with various
scheduler settings? (if yes, where can i download it and is there any
sample scene that would show similar effects.)
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-16 15:35 ` Ingo Molnar
@ 2008-01-16 16:10 ` Colin Fowler
2008-01-16 16:19 ` Ingo Molnar
0 siblings, 1 reply; 13+ messages in thread
From: Colin Fowler @ 2008-01-16 16:10 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Hi Ingo, I'll need to convince my supervisor first if I can release a
binary. Technically anythin glike this needs to go through our
University's "innovations department" and requires lengthy paperwork
and NDAs :(.
regards,
Colin
On Jan 16, 2008 3:35 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Colin Fowler <elethiomel@gmail.com> wrote:
>
>
> > > and context-switches 45K times a second. Do you know what is going
> > > on there? I thought ray-tracing is something that can be
> > > parallelized pretty efficiently, without having to contend and
> > > schedule too much.
> >
> > This is a RTRT (real-time ray tracing) system and as a result differs
> > from traditional offline ray-tracers as it is optimised for speed. The
> > benchmark I ran while these data were collected renders an 80K polygon
> > scene to a 512x512 buffer at just over 100fps.
> >
> > The context switches are most likely caused by the pthreads
> > synchronisation code. There are two mutexs. Each job is a 32x32 tile
> > and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
> > 100fps) * 2 =~50k. There's very likely where our context switches are
> > coming from. Larger tile sizes would of course reduce the locking
> > overhead, but then the ray-tracer suffers form load imbalance as some
> > tiles are much quicker to render than others. Empircally we've found
> > that this tile-size works the best for us.
> >
> > The CPU idling occurs as the system doesn't yet perform asynchronous
> > rendering. When all tiles in a current job queue are finished the
> > current frame is done. At this point all worker threads sleep while
> > the master thread blits the image to the screen and fills the job
> > queue for the next frame. The data probably shows that one CPU is kept
> > maxed and the others reach about 90% most of the time. This is
> > something on my TODO list to fix along with a myriad of other
> > optimisations :)
>
> is this something i could run myself and see how it behaves with various
> scheduler settings? (if yes, where can i download it and is there any
> sample scene that would show similar effects.)
>
> Ingo
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-16 16:10 ` Colin Fowler
@ 2008-01-16 16:19 ` Ingo Molnar
2008-01-16 16:38 ` Colin Fowler
0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2008-01-16 16:19 UTC (permalink / raw)
To: Colin Fowler; +Cc: linux-kernel, Peter Zijlstra
* Colin Fowler <elethiomel@gmail.com> wrote:
> Hi Ingo, I'll need to convince my supervisor first if I can release a
> binary. Technically anythin glike this needs to go through our
> University's "innovations department" and requires lengthy paperwork
> and NDAs :(.
a binary wouldnt work for me anyway. But you could try to write a
"workload simulator": just pick out the pthread ops and replace the
worker functions with some dummy stuff that just touches an array that
has similar size to the tiles (in a tight loop). Make sure it has
similar context-switch rate and idle percentage as your real workload -
then send us the .c file. As long as it's a single .c file that runs for
a few seconds and outputs a precise enough "run time" result, kernel
developers would pick it up and use it for optimizations. To get the #
of cpus automatically you can do:
cpus = system("exit `grep processor /proc/cpuinfo | wc -l`");
cpus = WEXITSTATUS(cpus);
and start as many threads as many CPUs there are in the system.
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-16 16:19 ` Ingo Molnar
@ 2008-01-16 16:38 ` Colin Fowler
0 siblings, 0 replies; 13+ messages in thread
From: Colin Fowler @ 2008-01-16 16:38 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Hi Ingo,
I have permission for a binary only release (mailed the supervisor
intermediately after your earler mail). I'm sure the abstract code
simulating the workload will be alright too, but I need time to put it
together as I'm a bit swamped at the moment. Will hope to have it in
the next few days.
regards,
Colin
On Jan 16, 2008 4:19 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Colin Fowler <elethiomel@gmail.com> wrote:
>
> > Hi Ingo, I'll need to convince my supervisor first if I can release a
> > binary. Technically anythin glike this needs to go through our
> > University's "innovations department" and requires lengthy paperwork
> > and NDAs :(.
>
> a binary wouldnt work for me anyway. But you could try to write a
> "workload simulator": just pick out the pthread ops and replace the
> worker functions with some dummy stuff that just touches an array that
> has similar size to the tiles (in a tight loop). Make sure it has
> similar context-switch rate and idle percentage as your real workload -
> then send us the .c file. As long as it's a single .c file that runs for
> a few seconds and outputs a precise enough "run time" result, kernel
> developers would pick it up and use it for optimizations. To get the #
> of cpus automatically you can do:
>
> cpus = system("exit `grep processor /proc/cpuinfo | wc -l`");
> cpus = WEXITSTATUS(cpus);
>
> and start as many threads as many CPUs there are in the system.
>
> Ingo
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-15 22:06 ` Ingo Molnar
2008-01-15 23:05 ` Colin Fowler
@ 2008-01-16 17:34 ` Colin Fowler
2008-01-18 10:55 ` Ingo Molnar
1 sibling, 1 reply; 13+ messages in thread
From: Colin Fowler @ 2008-01-16 17:34 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
> there are a handful of 'scheduler feature bits' in
> /proc/sys/kernel/sched_features:
>
> enum {
> SCHED_FEAT_NEW_FAIR_SLEEPERS = 1,
> SCHED_FEAT_WAKEUP_PREEMPT = 2,
> SCHED_FEAT_START_DEBIT = 4,
> SCHED_FEAT_TREE_AVG = 8,
> SCHED_FEAT_APPROX_AVG = 16,
> };
>
Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or
SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my
2.6.22 results. Toggling them both to 0 gives me slightly better
results than 2.6.22!
> /sys/devices/system/cpu/sched_mc_power_savings
>
> does that change the results?
>
no measurable difference on this toggle that I can see.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon
2008-01-16 17:34 ` Colin Fowler
@ 2008-01-18 10:55 ` Ingo Molnar
0 siblings, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2008-01-18 10:55 UTC (permalink / raw)
To: Colin Fowler; +Cc: linux-kernel, Peter Zijlstra
* Colin Fowler <elethiomel@gmail.com> wrote:
> > there are a handful of 'scheduler feature bits' in
> > /proc/sys/kernel/sched_features:
> >
> > enum {
> > SCHED_FEAT_NEW_FAIR_SLEEPERS = 1,
> > SCHED_FEAT_WAKEUP_PREEMPT = 2,
> > SCHED_FEAT_START_DEBIT = 4,
> > SCHED_FEAT_TREE_AVG = 8,
> > SCHED_FEAT_APPROX_AVG = 16,
> > };
> >
>
> Toggling SCHED_FEAT_NEW_FAIR_SLEEPERS to 0 or
> SCHED_FEAT_WAKEUP_PREEMPT to 0 gives me results more inline with my
> 2.6.22 results. Toggling them both to 0 gives me slightly better
> results than 2.6.22!
ok, but it would be nice to avoid having to turn these off. Could you
try whether tuning the /proc/sys/kernel/*granularity* values (in
particular wakeup_granularity) has any positive effect on your workload?
also, could you run your workload as SCHED_BATCH [via schedtool -B ],
does that improve the results as well on a default-tuned kernel?
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-01-18 10:56 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-14 17:37 Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon Colin Fowler
2008-01-14 18:55 ` Ingo Molnar
2008-01-14 22:42 ` Colin Fowler
2008-01-14 22:43 ` Colin Fowler
2008-01-15 17:01 ` Colin Fowler
2008-01-15 22:06 ` Ingo Molnar
2008-01-15 23:05 ` Colin Fowler
2008-01-16 15:35 ` Ingo Molnar
2008-01-16 16:10 ` Colin Fowler
2008-01-16 16:19 ` Ingo Molnar
2008-01-16 16:38 ` Colin Fowler
2008-01-16 17:34 ` Colin Fowler
2008-01-18 10:55 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox