Hyperthreading performance oddities

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Hyperthreading performance oddities
@ 2008-02-22  9:36 belcampo
  2008-02-22 10:06 ` Frederik Deweerdt
  0 siblings, 1 reply; 10+ messages in thread
From: belcampo @ 2008-02-22  9:36 UTC (permalink / raw)
  To: linux-kernel

Hi all,

I would like to be personally CC'ed the answers/comments posted to the 
list in response to my posting.

I have following CPU
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping        : 1
cpu MHz         : 3000.000
cache size      : 1024 KB

on 2.6.17 everything works as I expect. My Mandriva distribution 
standard has a 2.6.22.9 kernel which performed bad, numbers follow, 
installed a fresh 2.6.24.2 kernel also performed bad, so I installed 
2.6.17 and everything works OK.
I have some benchmarks from mplayer:
Kernel 2.6.22.9 smp hyperthreading
BENCHMARKs: VC: 334.042s VO:   0.053s A:   0.000s Sys:   4.049s =  338.143s
Kernel 2.6.22.9 nonsmp/hyperthreading
BENCHMARKs: VC: 262.008s VO:   0.031s A:   0.000s Sys:   3.528s =  265.567s
with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
BENCHMARKs: VC: 245.175s VO:   0.050s A:   0.000s Sys:   2.479s =  247.704s
with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
BENCHMARKs: VC: 227.992s VO:   0.051s A:   0.000s Sys:   2.551s =  230.594s

The 2.6.24.2 kernel had results as the 2.6.22.9 version

Regards Henk Schoneveld


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-02-22  9:36 Hyperthreading performance oddities belcampo
@ 2008-02-22 10:06 ` Frederik Deweerdt
  2008-03-07 13:37   ` Andrew Buehler
  0 siblings, 1 reply; 10+ messages in thread
From: Frederik Deweerdt @ 2008-02-22 10:06 UTC (permalink / raw)
  To: belcampo; +Cc: linux-kernel

Hello Henk,

On Fri, Feb 22, 2008 at 10:36:01AM +0100, belcampo wrote:
> Kernel 2.6.22.9 smp hyperthreading
> BENCHMARKs: VC: 334.042s VO:   0.053s A:   0.000s Sys:   4.049s =  338.143s
> Kernel 2.6.22.9 nonsmp/hyperthreading
> BENCHMARKs: VC: 262.008s VO:   0.031s A:   0.000s Sys:   3.528s =  265.567s
> with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
> BENCHMARKs: VC: 245.175s VO:   0.050s A:   0.000s Sys:   2.479s =  247.704s
> with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
> BENCHMARKs: VC: 227.992s VO:   0.051s A:   0.000s Sys:   2.551s =  230.594s
I'm not familiar with mplayer benchmarks, what do they actually
measure?

Regards,
Frederik

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-02-22 10:06 ` Frederik Deweerdt
@ 2008-03-07 13:37   ` Andrew Buehler
  2008-03-07 19:08     ` Chris Snook
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Buehler @ 2008-03-07 13:37 UTC (permalink / raw)
  To: Frederik Deweerdt; +Cc: belcampo, linux-kernel

(I'm aware that this could be considered thread necromancy, but I
haven't yet seen any indication that that is considered a bad thing in
these here parts; if it is, then I apologize, and upon being informed of
the fact will undertake to not commit such again.)

On 2/22/2008 5:06 AM, Frederik Deweerdt wrote:

> Hello Henk,
> 
> On Fri, Feb 22, 2008 at 10:36:01AM +0100, belcampo wrote:
> 
>> Kernel 2.6.22.9 smp hyperthreading
>> BENCHMARKs: VC: 334.042s VO:   0.053s A:   0.000s Sys:   4.049s =  338.143s
>> Kernel 2.6.22.9 nonsmp/hyperthreading
>> BENCHMARKs: VC: 262.008s VO:   0.031s A:   0.000s Sys:   3.528s =  265.567s
>> with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
>> BENCHMARKs: VC: 245.175s VO:   0.050s A:   0.000s Sys:   2.479s =  247.704s
>> with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
>> BENCHMARKs: VC: 227.992s VO:   0.051s A:   0.000s Sys:   2.551s =  230.594s
> 
> I'm not familiar with mplayer benchmarks, what do they actually 
> measure?

I don't know if this discussion got continued privately, but on the
assumption that it didn't, I think I can give at least a basic answer to
this.

The VC: value is the amount of time spent in the video-codec code during
that run, the VO: value is the amount of time spent in the video-output
code, the A: is the amount of time spent in (ISTR) audio processing -
though whether codec or audio-output or audio filters etc. is unclear, I
remember there being separate values for those rather than their being
lumped under one header- and the Sys: value is I believe the amount of
time spent in system calls.

(For the record: I'm a long-time lurker and occasional, largely
non-code, contributor on the MPlayer development lists, but I've never
had occasion to look at the code behind or the logic involved in the
-benchmark output.)

-- 
    Andrew Buehler

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-07 13:37   ` Andrew Buehler
@ 2008-03-07 19:08     ` Chris Snook
  2008-03-07 19:20       ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Snook @ 2008-03-07 19:08 UTC (permalink / raw)
  To: Andrew Buehler; +Cc: Frederik Deweerdt, belcampo, linux-kernel

Andrew Buehler wrote:
> (I'm aware that this could be considered thread necromancy, but I
> haven't yet seen any indication that that is considered a bad thing in
> these here parts; if it is, then I apologize, and upon being informed of
> the fact will undertake to not commit such again.)
> 
> On 2/22/2008 5:06 AM, Frederik Deweerdt wrote:
> 
>> Hello Henk,
>>
>> On Fri, Feb 22, 2008 at 10:36:01AM +0100, belcampo wrote:
>>
>>> Kernel 2.6.22.9 smp hyperthreading
>>> BENCHMARKs: VC: 334.042s VO:   0.053s A:   0.000s Sys:   4.049s =  
>>> 338.143s
>>> Kernel 2.6.22.9 nonsmp/hyperthreading
>>> BENCHMARKs: VC: 262.008s VO:   0.031s A:   0.000s Sys:   3.528s =  
>>> 265.567s
>>> with 2.6.17 kernel smp/hyperthreading pentium-pro as CPU
>>> BENCHMARKs: VC: 245.175s VO:   0.050s A:   0.000s Sys:   2.479s =  
>>> 247.704s
>>> with 2.6.17 kernel smp/hyperthreading pentium4 optimized kernel
>>> BENCHMARKs: VC: 227.992s VO:   0.051s A:   0.000s Sys:   2.551s =  
>>> 230.594s
>>
>> I'm not familiar with mplayer benchmarks, what do they actually measure?
> 
> I don't know if this discussion got continued privately, but on the
> assumption that it didn't, I think I can give at least a basic answer to
> this.
> 
> The VC: value is the amount of time spent in the video-codec code during
> that run, the VO: value is the amount of time spent in the video-output
> code, the A: is the amount of time spent in (ISTR) audio processing -
> though whether codec or audio-output or audio filters etc. is unclear, I
> remember there being separate values for those rather than their being
> lumped under one header- and the Sys: value is I believe the amount of
> time spent in system calls.
> 
> (For the record: I'm a long-time lurker and occasional, largely
> non-code, contributor on the MPlayer development lists, but I've never
> had occasion to look at the code behind or the logic involved in the
> -benchmark output.)
> 

Turning on hyperthreading effectively halves the amount of cache 
available for each logical CPU when both are doing work, which can do 
more harm than good.  Number-crunching applications that utilize the 
cache effectively generally don't benefit from hyperthreading, 
particularly floating-point-intensive ones.

On the other hand, hyperthreading is excellent for streaming integer 
work, like compiling.  Whether or not you should use it depends entirely 
on your workload.

-- Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-07 19:08     ` Chris Snook
@ 2008-03-07 19:20       ` Andi Kleen
  2008-03-08  7:12         ` belcampo
  2008-03-08  7:30         ` Willy Tarreau
  0 siblings, 2 replies; 10+ messages in thread
From: Andi Kleen @ 2008-03-07 19:20 UTC (permalink / raw)
  To: Chris Snook; +Cc: Andrew Buehler, Frederik Deweerdt, belcampo, linux-kernel

Chris Snook <csnook@redhat.com> writes:
> 
> Turning on hyperthreading effectively halves the amount of cache
> available for each logical CPU when both are doing work, which can do
> more harm than good.

When the two cores are in the same address space (as in being two
threads of the same process) L1 cache will be shared on P4. I think
for the other cases the cache management is also a little more
sophisticated than a simple split, depending on which HT generation
you're talking about (Intel had at least 4 generations out, each with
improvements over the earlier ones)

BTW your argument would be in theory true also for multi core with
shared L2 or L3, but even there the CPUs tend to be more sophisticated.
e.g. Core2 has a mechanism called "adaptive cache" which allows one
Core to use significantly more of the L2 in some cases.

>  Number-crunching applications that utilize the
> cache effectively generally don't benefit from hyperthreading,
> particularly floating-point-intensive ones.

That sounds like a far too broad over generalization to me.

-Andi (who personally always liked HT)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-07 19:20       ` Andi Kleen
@ 2008-03-08  7:12         ` belcampo
  2008-03-08  7:30         ` Willy Tarreau
  1 sibling, 0 replies; 10+ messages in thread
From: belcampo @ 2008-03-08  7:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Chris Snook, Andrew Buehler, Frederik Deweerdt, linux-kernel

Hi all,

Back to basics:

Kernel 2.6.22.9 smp hyperthreading needs 338.143s
with 2.6.17 kernel smp/hyperthreading needs 247.704s
for exactly the same job on the same machine.

For me it's not about HT vs. non-MT

Henk Schoneveld

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-07 19:20       ` Andi Kleen
  2008-03-08  7:12         ` belcampo
@ 2008-03-08  7:30         ` Willy Tarreau
  2008-03-08 11:46           ` Andi Kleen
  1 sibling, 1 reply; 10+ messages in thread
From: Willy Tarreau @ 2008-03-08  7:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Snook, Andrew Buehler, Frederik Deweerdt, belcampo,
	linux-kernel

Hi Andi,

On Fri, Mar 07, 2008 at 08:20:32PM +0100, Andi Kleen wrote:
> Chris Snook <csnook@redhat.com> writes:
> > 
> > Turning on hyperthreading effectively halves the amount of cache
> > available for each logical CPU when both are doing work, which can do
> > more harm than good.
> 
> When the two cores are in the same address space (as in being two
> threads of the same process) L1 cache will be shared on P4. I think
> for the other cases the cache management is also a little more
> sophisticated than a simple split, depending on which HT generation
> you're talking about (Intel had at least 4 generations out, each with
> improvements over the earlier ones)

Oh that's quite interesting to know.

> BTW your argument would be in theory true also for multi core with
> shared L2 or L3, but even there the CPUs tend to be more sophisticated.
> e.g. Core2 has a mechanism called "adaptive cache" which allows one
> Core to use significantly more of the L2 in some cases.
>
> >  Number-crunching applications that utilize the
> > cache effectively generally don't benefit from hyperthreading,
> > particularly floating-point-intensive ones.
> 
> That sounds like a far too broad over generalization to me.
> 
> -Andi (who personally always liked HT)

Well, in my experience, except for compiling, HT has always caused
massive slowdowns, especially on network-intensive applications.
Basically, network perf took a 20-30% hit, while compiling took
20-30% boost. But I must admit that I never tried HT on anything
more recent than a P4, maybe things have changed since.

regards,
willy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-08  7:30         ` Willy Tarreau
@ 2008-03-08 11:46           ` Andi Kleen
  2008-03-08 12:34             ` Willy Tarreau
  0 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2008-03-08 11:46 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andi Kleen, Chris Snook, Andrew Buehler, Frederik Deweerdt,
	belcampo, linux-kernel

> Well, in my experience, except for compiling, HT has always caused
> massive slowdowns, especially on network-intensive applications.
> Basically, network perf took a 20-30% hit, while compiling took

What network workload? Networking tends to have a lot of cache misses
and unless you're exceeding your memory bandwidth HT normally does 
well on such workloads because it can do other things while the 
CPU is waiting for loads.

> 20-30% boost. But I must admit that I never tried HT on anything
> more recent than a P4, maybe things have changed since.

There's nothing more recent out yet (unless you're talking non x86), 
but there were many different P4 generations. In particular Prescott
(90nm) was quite different from the earlier ones, but even before
and after there were some improvements and changes.

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-08 11:46           ` Andi Kleen
@ 2008-03-08 12:34             ` Willy Tarreau
  2008-03-08 12:43               ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Willy Tarreau @ 2008-03-08 12:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Snook, Andrew Buehler, Frederik Deweerdt, belcampo,
	linux-kernel

On Sat, Mar 08, 2008 at 12:46:55PM +0100, Andi Kleen wrote:
> > Well, in my experience, except for compiling, HT has always caused
> > massive slowdowns, especially on network-intensive applications.
> > Basically, network perf took a 20-30% hit, while compiling took
> 
> What network workload?

high session rate HTTP traffic. That means high packet rates, high
session lookup rates, etc...

> Networking tends to have a lot of cache misses
> and unless you're exceeding your memory bandwidth HT normally does 
> well on such workloads because it can do other things while the 
> CPU is waiting for loads.

On SMP, the load is generally divided with user-space on once CPU
and IRQs on the other one, but not well balanced though (less IRQ),
which means that SMP is rarely more than 50-60% faster than UP. On
HT, I normally observe lower performance than on UP.

> > 20-30% boost. But I must admit that I never tried HT on anything
> > more recent than a P4, maybe things have changed since.
> 
> There's nothing more recent out yet (unless you're talking non x86), 
> but there were many different P4 generations. In particular Prescott
> (90nm) was quite different from the earlier ones, but even before
> and after there were some improvements and changes.

OK. Amusingly, the HT flag is present on my C2D E8200 :

model name      : Intel(R) Core(TM)2 Duo CPU     E8200  @ 2.66GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm

Cheers,
willy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Hyperthreading performance oddities
  2008-03-08 12:34             ` Willy Tarreau
@ 2008-03-08 12:43               ` Andi Kleen
  0 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2008-03-08 12:43 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andi Kleen, Chris Snook, Andrew Buehler, Frederik Deweerdt,
	belcampo, linux-kernel

>On HT, I normally observe lower performance than on UP.

Hmm weird. It might be interesting to investigate in detail
what is going on there.

> model name      : Intel(R) Core(TM)2 Duo CPU     E8200  @ 2.66GHz
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm

Dual core systems generally have it, It leads to better scheduling
on some older OS because in many aspects dual core is nearer HT than
a true dual socket systems. There was no traditional way to express 
"core siblings" in CPUID so they just faked HT again, but added some 
additional ways to detect real dual coreness. AMD does it similar 
(but slightly different). Of course modern kernels don't need such
hacks anymore. 

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-03-08 12:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-22  9:36 Hyperthreading performance oddities belcampo
2008-02-22 10:06 ` Frederik Deweerdt
2008-03-07 13:37   ` Andrew Buehler
2008-03-07 19:08     ` Chris Snook
2008-03-07 19:20       ` Andi Kleen
2008-03-08  7:12         ` belcampo
2008-03-08  7:30         ` Willy Tarreau
2008-03-08 11:46           ` Andi Kleen
2008-03-08 12:34             ` Willy Tarreau
2008-03-08 12:43               ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox