public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Next patches for the 2.6.25 queue
@ 2007-12-13 14:46 Mathieu Desnoyers
  2007-12-13 15:49 ` Adrian Bunk
  2007-12-13 21:32 ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2007-12-13 14:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Ingo Molnar

Hi Andrew,

I would like to post my next patches in a way that would make it as
easy for you and the community to review them. Currently, the patches
that have really settled down are :

* For 2.6.25

- Text Edit Lock
  - Looks-good-to Ingo Molnar.
- Immediate Values
  - Redux version, asked by Rusty

* For 2.6.25 ?

Another patchset that is technically ok (however Rusty dislikes the
complexity inherent to the algorithms required to be reentrant wrt NMI
and MCE, although it's been reviewed by the community for months). I
have also replyed to Ingo's concerns about effeciency of my approach
compared to dtrace by providing numbers, but he has not replyed yet.
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg238317.html

- Markers use Immediate Values

* Maybe for 2.6.26 ...

Once we have this, and the instrumentation (submitted as RFC in the past
weeks), in the kernel, the only architecture dependent element that will
be left is the LTTng timestamping code.

And then, from that point, the following patchset is mostly
self-contained and stops modifying code all over the kernel tree. It
is the LTTng tracer.

Trying to improve my approach : I guess that submitting at most 15
patches at a time (each 1-2 days), against the -mmotm tree, would be the
way to do it ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Next patches for the 2.6.25 queue
  2007-12-13 14:46 Next patches for the 2.6.25 queue Mathieu Desnoyers
@ 2007-12-13 15:49 ` Adrian Bunk
  2007-12-14 16:43   ` Mathieu Desnoyers
  2007-12-13 21:32 ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Adrian Bunk @ 2007-12-13 15:49 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

On Thu, Dec 13, 2007 at 09:46:42AM -0500, Mathieu Desnoyers wrote:
> Hi Andrew,
> 
> I would like to post my next patches in a way that would make it as
> easy for you and the community to review them. Currently, the patches
> that have really settled down are :
> 
> * For 2.6.25
>...
> - Immediate Values
>   - Redux version, asked by Rusty
>...

I might have missed it:

Are there any real numbers (opposed to estimates and microbenchmarks) 
available how much performance we actually gain in which situations?

It might be some workload with markers using Immediate Values or 
something like that, but it should be something where the kernel
runs measurably faster with Immediate Values than without.

Currently I'm somewhere between "your Immediate Values are just an 
academic code obfuscation without any gain in practice" and "janitors 
should convert all drivers to use Immediate Values", and I'd like to 
form an opinion based on in which situations the kernel runs faster by 
how many percent.

That's also based on observation like e.g. that __read_mostly should 
improve the performance, but I've already seen situations in the kernel 
where it forced gcc to emit code that was obviously both bigger and 
slower than without the __read_mostly [1], and that's part of why I'm 
sceptical of all optimizations below the C level unless proven 
otherwise.

> Thanks,
> 
> Mathieu

cu
Adrian

[1] Figuring out what might have happened is left as an exercise to the 
    reader.  :-)

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Next patches for the 2.6.25 queue
  2007-12-13 14:46 Next patches for the 2.6.25 queue Mathieu Desnoyers
  2007-12-13 15:49 ` Adrian Bunk
@ 2007-12-13 21:32 ` Andrew Morton
  2007-12-13 22:18   ` Mathieu Desnoyers
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-12-13 21:32 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel, mingo

On Thu, 13 Dec 2007 09:46:42 -0500
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Hi Andrew,
> 
> I would like to post my next patches in a way that would make it as
> easy for you and the community to review them. Currently, the patches
> that have really settled down are :
> 
> * For 2.6.25
> 
> - Text Edit Lock
>   - Looks-good-to Ingo Molnar.
> - Immediate Values
>   - Redux version, asked by Rusty
> 
> * For 2.6.25 ?
> 
> Another patchset that is technically ok (however Rusty dislikes the
> complexity inherent to the algorithms required to be reentrant wrt NMI
> and MCE, although it's been reviewed by the community for months). I
> have also replyed to Ingo's concerns about effeciency of my approach
> compared to dtrace by providing numbers, but he has not replyed yet.
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg238317.html
> 
> - Markers use Immediate Values
> 
> * Maybe for 2.6.26 ...
> 
> Once we have this, and the instrumentation (submitted as RFC in the past
> weeks), in the kernel, the only architecture dependent element that will
> be left is the LTTng timestamping code.
> 
> And then, from that point, the following patchset is mostly
> self-contained and stops modifying code all over the kernel tree. It
> is the LTTng tracer.
> 
> Trying to improve my approach : I guess that submitting at most 15
> patches at a time (each 1-2 days), against the -mmotm tree, would be the
> way to do it ?
> 

Just for some context, I have...

- 1,400-odd open bugzilla reports

- 719 emails saved away in my emailed-bug-reports folder, all of which
  need to be gone through, asking originators to retest and
  re-report-if-unfixed.

- A big ugly email titled "2.6.24-rc5-git1: Reported regressions from
  2.6.23" in my inbox.

All of which makes it a bit inappropriate to be thinking about
intrusive-looking new features.

Ho hum.  Just send me the whole lot against rc5-mm1 and I'll stick it in
there and we'll see what breaks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Next patches for the 2.6.25 queue
  2007-12-13 21:32 ` Andrew Morton
@ 2007-12-13 22:18   ` Mathieu Desnoyers
  0 siblings, 0 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2007-12-13 22:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mingo

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 13 Dec 2007 09:46:42 -0500
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Hi Andrew,
> > 
> > I would like to post my next patches in a way that would make it as
> > easy for you and the community to review them. Currently, the patches
> > that have really settled down are :
> > 
> > * For 2.6.25
> > 
> > - Text Edit Lock
> >   - Looks-good-to Ingo Molnar.
> > - Immediate Values
> >   - Redux version, asked by Rusty
> > 
> > * For 2.6.25 ?
> > 
> > Another patchset that is technically ok (however Rusty dislikes the
> > complexity inherent to the algorithms required to be reentrant wrt NMI
> > and MCE, although it's been reviewed by the community for months). I
> > have also replyed to Ingo's concerns about effeciency of my approach
> > compared to dtrace by providing numbers, but he has not replyed yet.
> > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg238317.html
> > 
> > - Markers use Immediate Values
> > 
> > * Maybe for 2.6.26 ...
> > 
> > Once we have this, and the instrumentation (submitted as RFC in the past
> > weeks), in the kernel, the only architecture dependent element that will
> > be left is the LTTng timestamping code.
> > 
> > And then, from that point, the following patchset is mostly
> > self-contained and stops modifying code all over the kernel tree. It
> > is the LTTng tracer.
> > 
> > Trying to improve my approach : I guess that submitting at most 15
> > patches at a time (each 1-2 days), against the -mmotm tree, would be the
> > way to do it ?
> > 
> 
> Just for some context, I have...
> 
> - 1,400-odd open bugzilla reports
> 
> - 719 emails saved away in my emailed-bug-reports folder, all of which
>   need to be gone through, asking originators to retest and
>   re-report-if-unfixed.
> 
> - A big ugly email titled "2.6.24-rc5-git1: Reported regressions from
>   2.6.23" in my inbox.
> 
> All of which makes it a bit inappropriate to be thinking about
> intrusive-looking new features.
> 
> Ho hum.  Just send me the whole lot against rc5-mm1 and I'll stick it in
> there and we'll see what breaks.
> 

Ok,

Hum, the "whole lot" : including or excluding
- instrumentation
- the LTTng tracer ?

I realise that you have a lot of other things on your mind. One of my
goals would be to get LTTng in the -mm tree so it could help you resolve
these bugs. But on the other hand, I don't want to rush things. The
LTTng tracer could benefit of another round of RFC before it is ready
for prime time, but it would definitely be useful as-is in the mm tree.

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Next patches for the 2.6.25 queue
  2007-12-13 15:49 ` Adrian Bunk
@ 2007-12-14 16:43   ` Mathieu Desnoyers
  2007-12-15 22:14     ` Adrian Bunk
  0 siblings, 1 reply; 6+ messages in thread
From: Mathieu Desnoyers @ 2007-12-14 16:43 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

* Adrian Bunk (bunk@kernel.org) wrote:
> On Thu, Dec 13, 2007 at 09:46:42AM -0500, Mathieu Desnoyers wrote:
> > Hi Andrew,
> > 
> > I would like to post my next patches in a way that would make it as
> > easy for you and the community to review them. Currently, the patches
> > that have really settled down are :
> > 
> > * For 2.6.25
> >...
> > - Immediate Values
> >   - Redux version, asked by Rusty
> >...
> 
> I might have missed it:
> 
> Are there any real numbers (opposed to estimates and microbenchmarks) 
> available how much performance we actually gain in which situations?
> 
> It might be some workload with markers using Immediate Values or 
> something like that, but it should be something where the kernel
> runs measurably faster with Immediate Values than without.
> 
> Currently I'm somewhere between "your Immediate Values are just an 
> academic code obfuscation without any gain in practice" and "janitors 
> should convert all drivers to use Immediate Values", and I'd like to 
> form an opinion based on in which situations the kernel runs faster by 
> how many percent.
> 
> That's also based on observation like e.g. that __read_mostly should 
> improve the performance, but I've already seen situations in the kernel 
> where it forced gcc to emit code that was obviously both bigger and 
> slower than without the __read_mostly [1], and that's part of why I'm 
> sceptical of all optimizations below the C level unless proven 
> otherwise.
> 

Hi Adrian,

Yes, I had numbers that were presented in the patch headers, but I
re-ran some tests to have a clearer picture. Actually, what makes this
difficult to benchmark is the measurement error caused by the system's
"background noise" (interrupts, softirqs, kernel threads...). Note that
we are measuring cache effects and, therefore, any program which does
the same operation many times in a loop will benefit from space and time
locality and won't trigger many cache misses after the first loop.

So, here is what I have done to get a significant difference between the
with and without immediate values :

I ran, in userspace, a program that does random memory access
(3 times, in a 10MB array) between each getppid() syscall, everything
wrapped in a loop, repeated 1000 times (enough so the results are
reproduceable between runs). Tests were done on a 3GHz Pentium 4 with
2GB of ram with Linux 2.6.24-rc5.

I instrumented getppid() with 40 markers, so the impact of memory reads
won't be burried in the "background noise". Since each markers is using
a 24 bytes structure (8 bytes aligned), and are next to each other in
memory, we will cause (depending on the alignment of structures in the
cache lines) :

L1 cache lines : 64 bytes
L2 cache lines : 128 bytes

8-9 memory reads (L2 cache misses)
15-16 L2 accesses (L1 cache misses)

for each getppid() syscall.

The result is as expected :

Number of cycles for getppid

* Without memory pressure : 1470 cycles
* With memory pressure (std. dev. calculated on 3 groups of 1000 loops on
                        compiled out case : 416.54 cycles)
  * 40 markers without immediate values : 14938 cycles
  * 40 markers with immediate values :    12795 cycles
  * Markers compiled out :                12427 cycles

for a 14% speedup reached by using immediate values of data reads.
There seems to be no significant difference between compiling out the
markers and using immediate values to disable them.

Note that since the markers are located in the same cache lines, those
40 markers are the equivalent to have about 8 markers _not_ on the same
cache lines (in real life, that's very likely to be the case).

So, the conditions to have a speedup here :

- A significant amount of cache lines must be saved.
- They must be read from memory often.

So, we will likely see a real-life impact in situations such as :
instrumenting spinlocks; whenever they would be taken/released many
times in a system call made by an application doing random memory access
(a hash-based search engine would be a good example, a database would
also be a suitable workload), we should be able to measure the impact.
However, this is hard to reproduce/measure, so this is why I created a
synthetic workload simulating this behavior.

So I would really suggest using the immediate values for applications
such as :
- code markup (the markers)
- dynamically enable what would have otherwise been selected in
  menuconfig (such as profiling, scheduler/timer statistics for
  powertop...)

where the goal is to have _zero_ measurable impact on performance on any
workload.

Mathieu

> > Thanks,
> > 
> > Mathieu
> 
> cu
> Adrian
> 
> [1] Figuring out what might have happened is left as an exercise to the 
>     reader.  :-)
> 
> -- 
> 
>        "Is there not promise of rain?" Ling Tan asked suddenly out
>         of the darkness. There had been need of rain for many days.
>        "Only a promise," Lao Er said.
>                                        Pearl S. Buck - Dragon Seed
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Next patches for the 2.6.25 queue
  2007-12-14 16:43   ` Mathieu Desnoyers
@ 2007-12-15 22:14     ` Adrian Bunk
  0 siblings, 0 replies; 6+ messages in thread
From: Adrian Bunk @ 2007-12-15 22:14 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Andrew Morton, linux-kernel, Ingo Molnar

On Fri, Dec 14, 2007 at 11:43:35AM -0500, Mathieu Desnoyers wrote:
> * Adrian Bunk (bunk@kernel.org) wrote:
> > On Thu, Dec 13, 2007 at 09:46:42AM -0500, Mathieu Desnoyers wrote:
> > > Hi Andrew,
> > > 
> > > I would like to post my next patches in a way that would make it as
> > > easy for you and the community to review them. Currently, the patches
> > > that have really settled down are :
> > > 
> > > * For 2.6.25
> > >...
> > > - Immediate Values
> > >   - Redux version, asked by Rusty
> > >...
> > 
> > I might have missed it:
> > 
> > Are there any real numbers (opposed to estimates and microbenchmarks) 
> > available how much performance we actually gain in which situations?
> > 
> > It might be some workload with markers using Immediate Values or 
> > something like that, but it should be something where the kernel
> > runs measurably faster with Immediate Values than without.
> > 
> > Currently I'm somewhere between "your Immediate Values are just an 
> > academic code obfuscation without any gain in practice" and "janitors 
> > should convert all drivers to use Immediate Values", and I'd like to 
> > form an opinion based on in which situations the kernel runs faster by 
> > how many percent.
> > 
> > That's also based on observation like e.g. that __read_mostly should 
> > improve the performance, but I've already seen situations in the kernel 
> > where it forced gcc to emit code that was obviously both bigger and 
> > slower than without the __read_mostly [1], and that's part of why I'm 
> > sceptical of all optimizations below the C level unless proven 
> > otherwise.
> > 
> 
> Hi Adrian,

Hi Mathieu,

> Yes, I had numbers that were presented in the patch headers, but I
> re-ran some tests to have a clearer picture. Actually, what makes this
> difficult to benchmark is the measurement error caused by the system's
> "background noise" (interrupts, softirqs, kernel threads...). Note that
> we are measuring cache effects and, therefore, any program which does
> the same operation many times in a loop will benefit from space and time
> locality and won't trigger many cache misses after the first loop.
> 
> So, here is what I have done to get a significant difference between the
> with and without immediate values :
> 
> I ran, in userspace, a program that does random memory access
> (3 times, in a 10MB array) between each getppid() syscall, everything
> wrapped in a loop, repeated 1000 times (enough so the results are
> reproduceable between runs). Tests were done on a 3GHz Pentium 4 with
> 2GB of ram with Linux 2.6.24-rc5.

gcc version?

> I instrumented getppid() with 40 markers, so the impact of memory reads
> won't be burried in the "background noise". Since each markers is using
> a 24 bytes structure (8 bytes aligned), and are next to each other in
> memory, we will cause (depending on the alignment of structures in the
> cache lines) :
> 
> L1 cache lines : 64 bytes
> L2 cache lines : 128 bytes
> 
> 8-9 memory reads (L2 cache misses)
> 15-16 L2 accesses (L1 cache misses)
> 
> for each getppid() syscall.
> 
> The result is as expected :
> 
> Number of cycles for getppid
> 
> * Without memory pressure : 1470 cycles
> * With memory pressure (std. dev. calculated on 3 groups of 1000 loops on
>                         compiled out case : 416.54 cycles)
>   * 40 markers without immediate values : 14938 cycles
>   * 40 markers with immediate values :    12795 cycles
>   * Markers compiled out :                12427 cycles
> 
> for a 14% speedup reached by using immediate values of data reads.
> There seems to be no significant difference between compiling out the
> markers and using immediate values to disable them.

OK, it's good to see that there are situations where we can measure the 
benefits.

> Note that since the markers are located in the same cache lines, those
> 40 markers are the equivalent to have about 8 markers _not_ on the same
> cache lines (in real life, that's very likely to be the case).
> 
> So, the conditions to have a speedup here :
> 
> - A significant amount of cache lines must be saved.
> - They must be read from memory often.
> 
> So, we will likely see a real-life impact in situations such as :
> instrumenting spinlocks; whenever they would be taken/released many
> times in a system call made by an application doing random memory access
> (a hash-based search engine would be a good example, a database would
> also be a suitable workload), we should be able to measure the impact.
> However, this is hard to reproduce/measure, so this is why I created a
> synthetic workload simulating this behavior.
> 
> So I would really suggest using the immediate values for applications
> such as :
> - code markup (the markers)
> - dynamically enable what would have otherwise been selected in
>   menuconfig (such as profiling, scheduler/timer statistics for
>   powertop...)
> 
> where the goal is to have _zero_ measurable impact on performance on any
> workload.

Are the effects of your immediate values on gcc optimizations really 
well enough researched for being able to do this claim of
"zero measurable impact on performance on any workload"?

It should be quite easy to produce an artifical example where it has a 
huge influence on performance, and proving that this will never ever 
happen in practice sounds like a hard task.

> Mathieu

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-12-15 22:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-13 14:46 Next patches for the 2.6.25 queue Mathieu Desnoyers
2007-12-13 15:49 ` Adrian Bunk
2007-12-14 16:43   ` Mathieu Desnoyers
2007-12-15 22:14     ` Adrian Bunk
2007-12-13 21:32 ` Andrew Morton
2007-12-13 22:18   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox