lmbench lat_mmap slowdown with CONFIG

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* lmbench lat_mmap slowdown with CONFIG_PARAVIRT
@ 2009-01-20 11:05 Nick Piggin
  2009-01-20 11:26 ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2009-01-20 11:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Ingo Molnar, Linus Torvalds, hpa,
	jeremy, chrisw, zach, rusty

Hi,

I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed
down. On further investigation, a large part of this is not due to a
_regression_ as such, but the introduction of CONFIG_PARAVIRT=y.

Now, it is true that lat_mmap is basically a microbenchmark, however it
is exercising the memory mapping and page fault handler paths, so we're
talking about pretty important paths here. So I think it should be of
interest.

I've run the tests on a 2s8c AMD Barcelona system, binding the test to
CPU0, and running 100 times (stddev is a bit hard to bring down, and
my scripts needed 100 runs in order to pick up much smaller changes in
the results -- for CONFIG_PARAVIRT, just a couple of runs should show
up the problem).

Times I believe are in nanoseconds for lmbench, anyway lower is better.

non pv   AVG=464.22 STD=5.56
paravirt AVG=502.87 STD=7.36

Nearly 10% performance drop here, which is quite a bit... hopefully people
are testing the speed of their PV implementations against non-PV bare
metal :)


CPU: AMD64 family10, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10000
samples  %        symbol name
BASE
49749    32.6336  read_tsc
10151     6.6587  __up_read
7363      4.8299  unmap_vmas
7072      4.6390  mnt_drop_write
4107      2.6941  do_page_fault
4090      2.6829  rb_get_reader_page
3601      2.3621  apic_timer_interrupt
3537      2.3202  set_page_dirty
3435      2.2532  mnt_want_write
3302      2.1660  default_idle
2990      1.9613  idle_cpu
2904      1.9049  file_update_time
2789      1.8295  set_page_dirty_balance
2712      1.7790  retint_swapgs
2455      1.6104  __do_fault
2231      1.4635  release_pages
1989      1.3047  rb_buffer_peek
1895      1.2431  ring_buffer_consume
1572      1.0312  handle_mm_fault
1554      1.0194  put_page
1461      0.9584  sync_buffer
1196      0.7845  clear_page_c
1145      0.7511  rb_advance_reader
1144      0.7504  hweight64
1084      0.7111  getnstimeofday
1076      0.7058  __set_page_dirty_no_writeback
1020      0.6691  mark_page_accessed
751       0.4926  tick_do_update_jiffies64

CONFIG_PARAVIRT
8924      7.8849  native_safe_halt
8823      7.7957  native_read_tsc
6201      5.4790  default_spin_lock_flags
5806      5.1300  unmap_vmas
3996      3.5307  handle_mm_fault
3954      3.4936  rb_get_reader_page
3752      3.3151  __do_fault
2908      2.5694  getnstimeofday
2303      2.0348  apic_timer_interrupt
2183      1.9288  find_busiest_group
2057      1.8175  do_page_fault
2057      1.8175  hweight64
2017      1.7821  set_page_dirty
1926      1.7017  get_next_timer_interrupt
1781      1.5736  release_pages
1702      1.5038  native_pte_val
1620      1.4314  native_sched_clock
1588      1.4031  rebalance_domains
1558      1.3766  run_timer_softirq
1531      1.3527  __down_read_trylock
1505      1.3298  native_pmd_val
1445      1.2767  find_get_page
1369      1.2096  find_next_bit
1356      1.1981  __ticket_spin_lock
1225      1.0824  shmem_getpage
1169      1.0329  radix_tree_lookup_slot
1166      1.0302  vm_normal_page
987       0.8721  scheduler_tick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 11:05 lmbench lat_mmap slowdown with CONFIG_PARAVIRT Nick Piggin
@ 2009-01-20 11:26 ` Ingo Molnar
  2009-01-20 12:34   ` Nick Piggin
                     ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 11:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty


* Nick Piggin <npiggin@suse.de> wrote:

> Hi,
> 
> I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed 
> down. On further investigation, a large part of this is not due to a 
> _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> 
> Now, it is true that lat_mmap is basically a microbenchmark, however it 
> is exercising the memory mapping and page fault handler paths, so we're 
> talking about pretty important paths here. So I think it should be of 
> interest.
> 
> I've run the tests on a 2s8c AMD Barcelona system, binding the test to 
> CPU0, and running 100 times (stddev is a bit hard to bring down, and my 
> scripts needed 100 runs in order to pick up much smaller changes in the 
> results -- for CONFIG_PARAVIRT, just a couple of runs should show up the 
> problem).
> 
> Times I believe are in nanoseconds for lmbench, anyway lower is better.
> 
> non pv   AVG=464.22 STD=5.56
> paravirt AVG=502.87 STD=7.36
> 
> Nearly 10% performance drop here, which is quite a bit... hopefully 
> people are testing the speed of their PV implementations against non-PV 
> bare metal :)

Ouch, that looks unacceptably expensive. All the major distros turn 
CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
promise to have no measurable runtime overhead.

( And i suspect the real life mmap cost is probably even more expensive,
  as on a Barcelona all of lmbench fits into the cache hence we dont see
  any real $cache overhead. )

Jeremy, any ideas where this slowdown comes from and how it could be 
fixed?

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 11:26 ` Ingo Molnar
@ 2009-01-20 12:34   ` Nick Piggin
  2009-01-20 12:45     ` Ingo Molnar
  2009-01-20 14:03   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2009-01-20 12:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty

On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > Hi,
> > 
> > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed 
> > down. On further investigation, a large part of this is not due to a 
> > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> > 
> > Now, it is true that lat_mmap is basically a microbenchmark, however it 
> > is exercising the memory mapping and page fault handler paths, so we're 
> > talking about pretty important paths here. So I think it should be of 
> > interest.
> > 
> > I've run the tests on a 2s8c AMD Barcelona system, binding the test to 
> > CPU0, and running 100 times (stddev is a bit hard to bring down, and my 
> > scripts needed 100 runs in order to pick up much smaller changes in the 
> > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the 
> > problem).
> > 
> > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> > 
> > non pv   AVG=464.22 STD=5.56
> > paravirt AVG=502.87 STD=7.36
> > 
> > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > people are testing the speed of their PV implementations against non-PV 
> > bare metal :)
> 
> Ouch, that looks unacceptably expensive. All the major distros turn 
> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> promise to have no measurable runtime overhead.
> 
> ( And i suspect the real life mmap cost is probably even more expensive,
>   as on a Barcelona all of lmbench fits into the cache hence we dont see
>   any real $cache overhead. )

The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
kernel/. Definitely we don't see the worst of the icache or branch buffer 
overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )

 
> Jeremy, any ideas where this slowdown comes from and how it could be 
> fixed?

I had a bit of a poke around the profiles, but nothing stood out. However
oprofile counted 50% more cycles in the kernel with PV than with non-PV.
I'll have to take a look at the user/system times, because 50% seems
ludicrous.... hopefully it's just oprofile noise.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 12:34   ` Nick Piggin
@ 2009-01-20 12:45     ` Ingo Molnar
  2009-01-20 13:41       ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 12:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty


* Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > Hi,
> > > 
> > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed 
> > > down. On further investigation, a large part of this is not due to a 
> > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> > > 
> > > Now, it is true that lat_mmap is basically a microbenchmark, however it 
> > > is exercising the memory mapping and page fault handler paths, so we're 
> > > talking about pretty important paths here. So I think it should be of 
> > > interest.
> > > 
> > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to 
> > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my 
> > > scripts needed 100 runs in order to pick up much smaller changes in the 
> > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the 
> > > problem).
> > > 
> > > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> > > 
> > > non pv   AVG=464.22 STD=5.56
> > > paravirt AVG=502.87 STD=7.36
> > > 
> > > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > > people are testing the speed of their PV implementations against non-PV 
> > > bare metal :)
> > 
> > Ouch, that looks unacceptably expensive. All the major distros turn 
> > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> > promise to have no measurable runtime overhead.
> > 
> > ( And i suspect the real life mmap cost is probably even more expensive,
> >   as on a Barcelona all of lmbench fits into the cache hence we dont see
> >   any real $cache overhead. )
> 
> The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
> kernel/. Definitely we don't see the worst of the icache or branch buffer 
> overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )
> 
>  
> > Jeremy, any ideas where this slowdown comes from and how it could be 
> > fixed?
> 
> I had a bit of a poke around the profiles, but nothing stood out. 
> However oprofile counted 50% more cycles in the kernel with PV than with 
> non-PV. I'll have to take a look at the user/system times, because 50% 
> seems ludicrous.... hopefully it's just oprofile noise.

If you have a Core2 test-system could you please try tip/master, which 
also has your do_page_fault-de-bloating patch applied?

<plug>

The other advantage of tip/master would be that you could try precise 
performance counter measurements via:

   http://redhat.com/~mingo/perfcounters/timec.c

and split out the lmbench test-case into a standalone .c file loop. 
Running it as:

   $ taskset 0 ./timec -e -5,-4,-3,0,1,2,3 ./mmap-test

Will give you very precise information about what's going on in that 
workload:

 Performance counter stats for 'mmap-test':

  628315.871980  task clock ticks     (msecs)

          42330  CPU migrations       (events)
         124980  context switches     (events)
       18698292  pagefaults           (events)
  1351875946010  CPU cycles           (events)
  1121901478363  instructions         (events)
    10654788968  cache references     (events)
      633581867  cache misses         (events)

You might also want to try an NMI profile via kerneltop:

   http://redhat.com/~mingo/perfcounters/kerneltop.c

just run it with no arguments on a perfcounters kernel and it will give 
you something like:

 ------------------------------------------------------------------------------
  KernelTop:   20297 irqs/sec  [NMI, 10000 cache-misses],  (all, 8 CPUs)
 ------------------------------------------------------------------------------

              events         RIP          kernel function
   ______     ______   ________________   _______________

            12816.00 - ffffffff803d5760 : copy_user_generic_string!
            11751.00 - ffffffff80647a2c : unix_stream_recvmsg
            10215.00 - ffffffff805eda5f : sock_alloc_send_skb
             9738.00 - ffffffff80284821 : flush_free_list
             6749.00 - ffffffff802854a1 : __kmalloc_track_caller
             3663.00 - ffffffff805f09fa : skb_dequeue
             3591.00 - ffffffff80284be2 : kmem_cache_alloc       [qla2xxx]
             3501.00 - ffffffff805f15f5 : __alloc_skb
             1296.00 - ffffffff803d8eb4 : list_del       [qla2xxx]
             1110.00 - ffffffff805f0ed2 : kfree_skb

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 12:45     ` Ingo Molnar
@ 2009-01-20 13:41       ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2009-01-20 13:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty

On Tue, Jan 20, 2009 at 01:45:00PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote:
> > > 
> > > * Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > Hi,
> > > > 
> > > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed 
> > > > down. On further investigation, a large part of this is not due to a 
> > > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y.
> > > > 
> > > > Now, it is true that lat_mmap is basically a microbenchmark, however it 
> > > > is exercising the memory mapping and page fault handler paths, so we're 
> > > > talking about pretty important paths here. So I think it should be of 
> > > > interest.
> > > > 
> > > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to 
> > > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my 
> > > > scripts needed 100 runs in order to pick up much smaller changes in the 
> > > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the 
> > > > problem).
> > > > 
> > > > Times I believe are in nanoseconds for lmbench, anyway lower is better.
> > > > 
> > > > non pv   AVG=464.22 STD=5.56
> > > > paravirt AVG=502.87 STD=7.36
> > > > 
> > > > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > > > people are testing the speed of their PV implementations against non-PV 
> > > > bare metal :)
> > > 
> > > Ouch, that looks unacceptably expensive. All the major distros turn 
> > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> > > promise to have no measurable runtime overhead.
> > > 
> > > ( And i suspect the real life mmap cost is probably even more expensive,
> > >   as on a Barcelona all of lmbench fits into the cache hence we dont see
> > >   any real $cache overhead. )
> > 
> > The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and
> > kernel/. Definitely we don't see the worst of the icache or branch buffer 
> > overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( )
> > 
> >  
> > > Jeremy, any ideas where this slowdown comes from and how it could be 
> > > fixed?
> > 
> > I had a bit of a poke around the profiles, but nothing stood out. 
> > However oprofile counted 50% more cycles in the kernel with PV than with 
> > non-PV. I'll have to take a look at the user/system times, because 50% 
> > seems ludicrous.... hopefully it's just oprofile noise.

kbuild costs go up a bit (average of 30 builds)
elapsed
non-pv: AVG=53.31s STD=0.99
pv:     AVG=53.54s STD=0.94

user
non-pv: AVG=318.63s STD=0.19
pv:     AVG=319.33s STD=0.23

system
non-pv: AVG=30.56s STD=0.15
pv:     AVG=31.80s STD=0.15

kernel side of the kbuild workload slows down by 4.1%. User time also
increases a bit (probably more cache and branch misses).

 
> If you have a Core2 test-system could you please try tip/master, which 
> also has your do_page_fault-de-bloating patch applied?

Will try to get one to do some runs on.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 11:26 ` Ingo Molnar
  2009-01-20 12:34   ` Nick Piggin
@ 2009-01-20 14:03   ` Ingo Molnar
  2009-01-20 14:14     ` Nick Piggin
                       ` (3 more replies)
  2009-01-20 19:05   ` Zachary Amsden
  2009-01-22 22:26   ` Jeremy Fitzhardinge
  3 siblings, 4 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 14:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> > Times I believe are in nanoseconds for lmbench, anyway lower is 
> > better.
> > 
> > non pv   AVG=464.22 STD=5.56
> > paravirt AVG=502.87 STD=7.36
> > 
> > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > people are testing the speed of their PV implementations against 
> > non-PV bare metal :)
> 
> Ouch, that looks unacceptably expensive. All the major distros turn 
> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> promise to have no measurable runtime overhead.

Here are some more precise stats done via hw counters on a perfcounters 
kernel using 'timec', running a modified version of the 'mmap performance 
stress-test' app i made years ago.

The MM benchmark app can be downloaded from:

   http://redhat.com/~mingo/misc/mmap-perf.c

timec.c can be picked up from:

   http://redhat.com/~mingo/perfcounters/timec.c

mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches 
the mapped area as well with a certain chance. The patterns are 
pseudo-random and the random seed is initialized to the same value so 
repeated runs produce the exact same mmap sequence.

I ran the test with a single thread and bound to a single core:

  # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1

[ I ran it as root - so that kernel-space hardware-counter statistics are 
  included as well. ]

The results are quite surprisingly candid about the true costs of 
paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y):

-----------------------------------------------
| Performance counter stats for './mmap-perf' |
-----------------------------------------------
|                |
|  x86-defconfig |   PARAVIRT=y         
|------------------------------------------------------------------
|
|    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
|                |
|              1 |            1  CPU migrations
|             91 |           79  context switches
|          55945 |        55943  pagefaults
|    ............................................
|     3781392474 |   3918777174  CPU cycles                  +3.63%
|     1957153827 |   2161280486  instructions               +10.43%
|       50234816 |     51303520  cache references            +2.12%
|        5428258 |      5583728  cache misses                +2.86%
|                |
|    1314.782469 |  1363.694447  time elapsed (msecs)        +3.72%
|                |
-----------------------------------

The most surprising element is that in the paravirt_ops case we run 204 
million more instructions - out of the ~2000 million instructions total. 

That's an increase of over 10%!

That shows the expected $cache risks here as well: i ran this on an 
Extreme Edition CPU which has a ton of L2 cache [4MB] which mutes L2 
$cache misses quite a bit.

Note that this workload tests a broader range of MM related codepaths - 
not just pure pagefault costs.

	Ingo

ps. Measurement methodology:

    The software counters show that the test was indeed done on an idle 
    system: there are no CPU migrations (the task is affine), nor any 
    significant context-switches, and the pagefault count is essentially 
    the same as well. (because this is a fully repeatable workload.)

    The numbers are a representative sample from a run of more than 10 
    testruns, on an otherwise idle system. Measurement noise is very low:

      3906920196  CPU cycles           (events)
      3907556124  CPU cycles           (events)
      3907902335  CPU cycles           (events)
      3914423870  CPU cycles           (events)
      3915642464  CPU cycles           (events)
      3916134988  CPU cycles           (events)
      3916840093  CPU cycles           (events)
      3918777174  CPU cycles           (events)
      3918993251  CPU cycles           (events)
      3919907192  CPU cycles           (events)

    The max/min spread of 10 runs is 0.3%, so the precision of this 
    measurement is in the 0.1% range - more than enough to be conclusive.

    The max/min spread of the instruction counts is even better: in the 
    0.01% range. (that is because exactly the same workload is executed - 
    only timer IRQs and small disturbances cause noise here.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:03   ` Ingo Molnar
@ 2009-01-20 14:14     ` Nick Piggin
  2009-01-20 14:17       ` Ingo Molnar
  2009-01-20 15:13     ` Ingo Molnar
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2009-01-20 14:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton

On Tue, Jan 20, 2009 at 03:03:24PM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > Times I believe are in nanoseconds for lmbench, anyway lower is 
> > > better.
> > > 
> > > non pv   AVG=464.22 STD=5.56
> > > paravirt AVG=502.87 STD=7.36
> > > 
> > > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > > people are testing the speed of their PV implementations against 
> > > non-PV bare metal :)
> > 
> > Ouch, that looks unacceptably expensive. All the major distros turn 
> > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> > promise to have no measurable runtime overhead.
> 
> Here are some more precise stats done via hw counters on a perfcounters 
> kernel using 'timec', running a modified version of the 'mmap performance 
> stress-test' app i made years ago.
> 
> The MM benchmark app can be downloaded from:
> 
>    http://redhat.com/~mingo/misc/mmap-perf.c

BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets
compiled into a standalone lat_mmap exec by the standard lmbench build).

 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:14     ` Nick Piggin
@ 2009-01-20 14:17       ` Ingo Molnar
  2009-01-20 14:41         ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 14:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton


* Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Jan 20, 2009 at 03:03:24PM +0100, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > > Times I believe are in nanoseconds for lmbench, anyway lower is 
> > > > better.
> > > > 
> > > > non pv   AVG=464.22 STD=5.56
> > > > paravirt AVG=502.87 STD=7.36
> > > > 
> > > > Nearly 10% performance drop here, which is quite a bit... hopefully 
> > > > people are testing the speed of their PV implementations against 
> > > > non-PV bare metal :)
> > > 
> > > Ouch, that looks unacceptably expensive. All the major distros turn 
> > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> > > promise to have no measurable runtime overhead.
> > 
> > Here are some more precise stats done via hw counters on a perfcounters 
> > kernel using 'timec', running a modified version of the 'mmap performance 
> > stress-test' app i made years ago.
> > 
> > The MM benchmark app can be downloaded from:
> > 
> >    http://redhat.com/~mingo/misc/mmap-perf.c
> 
> BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets 
> compiled into a standalone lat_mmap exec by the standard lmbench build).

doesnt that include an indeterminate number of gettimeofday() based 
calibration calls? That would make it harder to measure its total costs in 
a comparative way.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:17       ` Ingo Molnar
@ 2009-01-20 14:41         ` Nick Piggin
  2009-01-20 15:00           ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2009-01-20 14:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton

On Tue, Jan 20, 2009 at 03:17:35PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> > 
> > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets 
> > compiled into a standalone lat_mmap exec by the standard lmbench build).
> 
> doesnt that include an indeterminate number of gettimeofday() based 
> calibration calls? That would make it harder to measure its total costs in 
> a comparative way.

Hmm... yes probably for really detailed profile comparisons or
other external measurements it would need modification.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:41         ` Nick Piggin
@ 2009-01-20 15:00           ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 15:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton


* Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Jan 20, 2009 at 03:17:35PM +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets 
> > > compiled into a standalone lat_mmap exec by the standard lmbench build).
> > 
> > doesnt that include an indeterminate number of gettimeofday() based 
> > calibration calls? That would make it harder to measure its total costs in 
> > a comparative way.
> 
> Hmm... yes probably for really detailed profile comparisons or other 
> external measurements it would need modification.

yeah.

Btw., it's a trend to be aware of i think: as our commit flux goes up and 
the average commit size goes down, it becomes harder and harder to measure 
the per commit performance impact.

There's just 3 ways to handle it: decrease commit flux (which is out of 
question), or to increase commits size (wich is out of question as well), 
or to improve the quality of our measurements.

We can improve performance measurement quality in a number of ways:

 - We can (and should) increase instrumentation precision

     /usr/bin/time's 10 msec measurement granularity might have been
     fine a decade ago but it is not fine today.

 - We can (and should) increase the number of 'dimensions' (metrics) we 
   can instrument the kernel with.

     Right now we basically only measure along the time axis, in 99% of 
     the cases. But 'elapsed time' is a tricky, compound and thus noisy 
     unit: it is affected by all delays in a workload. We do profiles 
     occasionally, but they are a lot more difficult to generate and a lot 
     harder to compare and are hard to be plugged into regression 
     analysis.

So if we see a statistically significant shift in one of more metrics of 
something like:

-------------------------------------------------
|
| $ ./timec -e -5,-4,-3,0,1,2,3 make -j16 bzImage
|
| [...]
| Kernel: arch/x86/boot/bzImage is ready  (#28)
|
| Performance counter stats for 'make':
|
|  628315.871980  task clock ticks     (msecs)
|
|          42330  CPU migrations       (events)
|         124980  context switches     (events)
|       18698292  pagefaults           (events)
|  1351875946010  CPU cycles           (events)
|  1121901478363  instructions         (events)
|    10654788968  cache references     (events)
|      633581867  cache misses         (events)
|
| Wall-clock time elapsed: 118348.109066 msecs
|
-----------------------------------------------

Becomes a _lot_ harder to ignore (and talk out of existence) than it is to 
ignore a few minor digits changing in:

---------------------------------
|
| $ time make -j16 bzImage
|
|  real	0m12.146s
|  user	1m30.050s
|  sys	0m12.757s
|
---------------------------------

( Especially as those minor digits tend to be rather noisy to begin with,
  due to us sampling system/user time from the timer interrupt. )

It becomes even harder to ignore statistically significant regressions if 
some of the metrics are hardware-generated hard physical facts - not 
something wishy-washy and statistical as stime/utime statistics.

</plug> ;-)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:03   ` Ingo Molnar
  2009-01-20 14:14     ` Nick Piggin
@ 2009-01-20 15:13     ` Ingo Molnar
  2009-01-20 19:37     ` Ingo Molnar
  2009-01-20 20:45     ` Jeremy Fitzhardinge
  3 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 15:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> That shows the expected $cache risks here as well: i ran this on an 
> Extreme Edition CPU which has a ton of L2 cache [4MB] which mutes L2 
> $cache misses quite a bit.

[ there's no such thing as "L2 $cache misses" - what i wanted to say is 
  that while the instruction cache size is rather small and static, a
  large L2 cache helps in keeping the costs of instruction-cache misses
  low - hence my measurement skews in favor of paravirt. ]

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:03   ` Ingo Molnar
  2009-01-20 14:14     ` Nick Piggin
  2009-01-20 15:13     ` Ingo Molnar
@ 2009-01-20 19:37     ` Ingo Molnar
  2009-01-20 20:45     ` Jeremy Fitzhardinge
  3 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 19:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw,
	zach, rusty, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> -----------------------------------------------
> | Performance counter stats for './mmap-perf' |
> -----------------------------------------------
> |                |
> |  x86-defconfig |   PARAVIRT=y         
> |------------------------------------------------------------------
> |
> |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
> |                |
> |              1 |            1  CPU migrations
> |             91 |           79  context switches
> |          55945 |        55943  pagefaults
> |    ............................................
> |     3781392474 |   3918777174  CPU cycles                  +3.63%
> |     1957153827 |   2161280486  instructions               +10.43%
> |       50234816 |     51303520  cache references            +2.12%
> |        5428258 |      5583728  cache misses                +2.86%
> |                |
> |    1314.782469 |  1363.694447  time elapsed (msecs)        +3.72%
> |                |
> -----------------------------------
> 
> The most surprising element is that in the paravirt_ops case we run 204 
> million more instructions - out of the ~2000 million instructions total.

So because this test does exactly 1 million MM syscalls, the average is 
easy to calculate:

The native kernel's average MM syscall cost is 1957 instructions - with 
CONFIG_PARAVIRT=y that increases by +10.43% to 2161 instructions. There's 
over 200 extra instructions executed per MM syscall that we only do due to 
CONFIG_PARAVIRT=y.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 14:03   ` Ingo Molnar
                       ` (2 preceding siblings ...)
  2009-01-20 19:37     ` Ingo Molnar
@ 2009-01-20 20:45     ` Jeremy Fitzhardinge
  2009-01-20 20:56       ` Ingo Molnar
  3 siblings, 1 reply; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-20 20:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa,
	jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>   
>>> Times I believe are in nanoseconds for lmbench, anyway lower is 
>>> better.
>>>
>>> non pv   AVG=464.22 STD=5.56
>>> paravirt AVG=502.87 STD=7.36
>>>
>>> Nearly 10% performance drop here, which is quite a bit... hopefully 
>>> people are testing the speed of their PV implementations against 
>>> non-PV bare metal :)
>>>       
>> Ouch, that looks unacceptably expensive. All the major distros turn 
>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
>> promise to have no measurable runtime overhead.
>>     
>
> Here are some more precise stats done via hw counters on a perfcounters 
> kernel using 'timec', running a modified version of the 'mmap performance 
> stress-test' app i made years ago.
>
> The MM benchmark app can be downloaded from:
>
>    http://redhat.com/~mingo/misc/mmap-perf.c
>
> timec.c can be picked up from:
>
>    http://redhat.com/~mingo/perfcounters/timec.c
>
> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches 
> the mapped area as well with a certain chance. The patterns are 
> pseudo-random and the random seed is initialized to the same value so 
> repeated runs produce the exact same mmap sequence.
>
> I ran the test with a single thread and bound to a single core:
>
>   # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1
>
> [ I ran it as root - so that kernel-space hardware-counter statistics are 
>   included as well. ]
>
> The results are quite surprisingly candid about the true costs of 
> paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y):
>
> -----------------------------------------------
> | Performance counter stats for './mmap-perf' |
> -----------------------------------------------
> |                |
> |  x86-defconfig |   PARAVIRT=y         
> |------------------------------------------------------------------
> |
> |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
> |                |
> |              1 |            1  CPU migrations
> |             91 |           79  context switches
> |          55945 |        55943  pagefaults
> |    ............................................
> |     3781392474 |   3918777174  CPU cycles                  +3.63%
> |     1957153827 |   2161280486  instructions               +10.43%
>   

!!

> |       50234816 |     51303520  cache references            +2.12%
> |        5428258 |      5583728  cache misses                +2.86%
>   

Is this I or D, or combined?

> |                |
> |    1314.782469 |  1363.694447  time elapsed (msecs)        +3.72%
> |                |
> -----------------------------------
>
> The most surprising element is that in the paravirt_ops case we run 204 
> million more instructions - out of the ~2000 million instructions total. 
>
> That's an increase of over 10%!
>   

Yow!  That's pretty awful.  We knew that static instruction count was 
up, but wouldn't have thought that it would hit the dynamic instruction 
count so much...

I think there are some immediate tweaks we can make to the code 
generated for each call site,  which will help to an extent.

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 20:45     ` Jeremy Fitzhardinge
@ 2009-01-20 20:56       ` Ingo Molnar
  2009-01-21  7:27         ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 20:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa,
	jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> * Ingo Molnar <mingo@elte.hu> wrote:
>>
>>   
>>>> Times I believe are in nanoseconds for lmbench, anyway lower is  
>>>> better.
>>>>
>>>> non pv   AVG=464.22 STD=5.56
>>>> paravirt AVG=502.87 STD=7.36
>>>>
>>>> Nearly 10% performance drop here, which is quite a bit... hopefully 
>>>> people are testing the speed of their PV implementations against  
>>>> non-PV bare metal :)
>>>>       
>>> Ouch, that looks unacceptably expensive. All the major distros turn  
>>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the 
>>> express promise to have no measurable runtime overhead.
>>>     
>>
>> Here are some more precise stats done via hw counters on a perfcounters 
>> kernel using 'timec', running a modified version of the 'mmap 
>> performance stress-test' app i made years ago.
>>
>> The MM benchmark app can be downloaded from:
>>
>>    http://redhat.com/~mingo/misc/mmap-perf.c
>>
>> timec.c can be picked up from:
>>
>>    http://redhat.com/~mingo/perfcounters/timec.c
>>
>> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and 
>> touches the mapped area as well with a certain chance. The patterns are 
>> pseudo-random and the random seed is initialized to the same value so  
>> repeated runs produce the exact same mmap sequence.
>>
>> I ran the test with a single thread and bound to a single core:
>>
>>   # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1
>>
>> [ I ran it as root - so that kernel-space hardware-counter statistics 
>> are   included as well. ]
>>
>> The results are quite surprisingly candid about the true costs of  
>> paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y):
>>
>> -----------------------------------------------
>> | Performance counter stats for './mmap-perf' |
>> -----------------------------------------------
>> |                |
>> |  x86-defconfig |   PARAVIRT=y          
>> |------------------------------------------------------------------
>> |
>> |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
>> |                |
>> |              1 |            1  CPU migrations
>> |             91 |           79  context switches
>> |          55945 |        55943  pagefaults
>> |    ............................................
>> |     3781392474 |   3918777174  CPU cycles                  +3.63%
>> |     1957153827 |   2161280486  instructions               +10.43%
>>   
>
> !!
>
>> |       50234816 |     51303520  cache references            +2.12%
>> |        5428258 |      5583728  cache misses                +2.86%
>>   
>
> Is this I or D, or combined?

That's last-level-cache references+misses (L2 cache):

 Bit Position Event Name                UMask Event Select
 CPUID.AH.EBX
 3            LLC Reference             4FH   2EH
 4            LLC Misses                41H   2EH

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 20:56       ` Ingo Molnar
@ 2009-01-21  7:27         ` Nick Piggin
  2009-01-21 22:23           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2009-01-21  7:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Linux Kernel Mailing List, Linus Torvalds,
	hpa, jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel

On Tue, Jan 20, 2009 at 09:56:53PM +0100, Ingo Molnar wrote:
> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> >> |       50234816 |     51303520  cache references            +2.12%
> >> |        5428258 |      5583728  cache misses                +2.86%
> >>   
> >
> > Is this I or D, or combined?
> 
> That's last-level-cache references+misses (L2 cache):
> 
>  Bit Position Event Name                UMask Event Select
>  CPUID.AH.EBX
>  3            LLC Reference             4FH   2EH
>  4            LLC Misses                41H   2EH

Oh, _llc_ references/misses? Ouch.

You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
coming from? Instruction fetches?

It would be interesting to see how "the oltp" benchmark fares with
CONFIG_PARAVIRT turned on. That workload lives and dies by the cache :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-21  7:27         ` Nick Piggin
@ 2009-01-21 22:23           ` Jeremy Fitzhardinge
  2009-01-22 22:28             ` Zachary Amsden
  0 siblings, 1 reply; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-21 22:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, hpa,
	jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel

Nick Piggin wrote:
> Oh, _llc_ references/misses? Ouch.
>
> You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
> is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
> coming from? Instruction fetches?
>   

I assume so.  There should be no extra data accesses with 
CONFIG_PARAVIRT (hm, there's probably some extra stack/spill traffic, 
but I surely hope that's not falling out of cache).

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-21 22:23           ` Jeremy Fitzhardinge
@ 2009-01-22 22:28             ` Zachary Amsden
  2009-01-22 22:44               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 32+ messages in thread
From: Zachary Amsden @ 2009-01-22 22:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

On Wed, 2009-01-21 at 14:23 -0800, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Oh, _llc_ references/misses? Ouch.
> >
> > You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
> > is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
> > coming from? Instruction fetches?
> >   
> 
> I assume so.  There should be no extra data accesses with 
> CONFIG_PARAVIRT (hm, there's probably some extra stack/spill traffic, 
> but I surely hope that's not falling out of cache).

These fragments, from native_pgd_val, certainly don't help:

c0120f60:       55                      push   %ebp
c0120f61:       89 e5                   mov    %esp,%ebp
c0120f63:       5d                      pop    %ebp
c0120f64:       c3                      ret
c0120f65:       8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
c0120f69:       8d bc 27 00 00 00 00    lea    0x0(%edi,%eiz,1),%edi

That is really disgusting.  We absolutely should be patching away the
function calls here in the native case.. not sure we do that today.

Zach


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:28             ` Zachary Amsden
@ 2009-01-22 22:44               ` Jeremy Fitzhardinge
  2009-01-22 22:49                 ` H. Peter Anvin
  2009-01-22 22:55                 ` Zachary Amsden
  0 siblings, 2 replies; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-22 22:44 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Nick Piggin, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com,
	jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au,
	Andrew Morton, Xen-devel

Zachary Amsden wrote:
> These fragments, from native_pgd_val, certainly don't help:
>
> c0120f60:       55                      push   %ebp
> c0120f61:       89 e5                   mov    %esp,%ebp
> c0120f63:       5d                      pop    %ebp
> c0120f64:       c3                      ret
> c0120f65:       8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
> c0120f69:       8d bc 27 00 00 00 00    lea    0x0(%edi,%eiz,1),%edi
>   

Yes, that's a rather awful noop; compiling without frame pointers 
reduces this to a single "ret".

> That is really disgusting.  We absolutely should be patching away the
> function calls here in the native case.. not sure we do that today.
>   

I did have some patches to do that at one point.  If you set pgd_val = 
paravirt_nop, then the patching machinery will completely nop out the 
call site.  The problem is that it depends on the calling convention 
using the same regs for the first arg and return - true for 32-bit, but 
not 64.  We could fix that with identity functions which the patcher 
recognizes and can replace with either pure nops or inline appropriate 
register moves.

Also, I just posted patches to get rid of all pvops calls when fetching 
or setting flags in a pte, which I hope will help.

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:44               ` Jeremy Fitzhardinge
@ 2009-01-22 22:49                 ` H. Peter Anvin
  2009-01-22 22:58                   ` Zachary Amsden
  2009-01-22 22:55                 ` Zachary Amsden
  1 sibling, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2009-01-22 22:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, Nick Piggin, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

Jeremy Fitzhardinge wrote:
> 
> I did have some patches to do that at one point.  If you set pgd_val = 
> paravirt_nop, then the patching machinery will completely nop out the 
> call site.  The problem is that it depends on the calling convention 
> using the same regs for the first arg and return - true for 32-bit, but 
> not 64.  We could fix that with identity functions which the patcher 
> recognizes and can replace with either pure nops or inline appropriate 
> register moves.
> 

There is also the option to use assembly wrappers to avoid relying on 
the calling convention.  This is particularly so since we have sites 
where as little as a two-byte instruction gets bloated up with huge 
push/pop sequences around a tiny instruction.  Those would be better 
served with a direct call to a stub (5 bytes), which would be repatched 
to the two-byte instruction + 3 byte nop.

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:49                 ` H. Peter Anvin
@ 2009-01-22 22:58                   ` Zachary Amsden
  2009-01-22 23:52                     ` H. Peter Anvin
  0 siblings, 1 reply; 32+ messages in thread
From: Zachary Amsden @ 2009-01-22 22:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Nick Piggin, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

On Thu, 2009-01-22 at 14:49 -0800, H. Peter Anvin wrote:

> There is also the option to use assembly wrappers to avoid relying on 
> the calling convention.  This is particularly so since we have sites 
> where as little as a two-byte instruction gets bloated up with huge 
> push/pop sequences around a tiny instruction.  Those would be better 
> served with a direct call to a stub (5 bytes), which would be repatched 
> to the two-byte instruction + 3 byte nop.

Yes, for known trivial ops (most!), there isn't any reason to ever have
a call to begin with; simply an inline instruction sequence would be
fine, and only those callers that override the sequence would need to
patch.  It's possible to write clever macros to assure there is always
space for a 5 byte call.

Zach


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:58                   ` Zachary Amsden
@ 2009-01-22 23:52                     ` H. Peter Anvin
  2009-01-23  0:08                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2009-01-22 23:52 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Nick Piggin, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

Zachary Amsden wrote:
> On Thu, 2009-01-22 at 14:49 -0800, H. Peter Anvin wrote:
> 
>> There is also the option to use assembly wrappers to avoid relying on 
>> the calling convention.  This is particularly so since we have sites 
>> where as little as a two-byte instruction gets bloated up with huge 
>> push/pop sequences around a tiny instruction.  Those would be better 
>> served with a direct call to a stub (5 bytes), which would be repatched 
>> to the two-byte instruction + 3 byte nop.
> 
> Yes, for known trivial ops (most!), there isn't any reason to ever have
> a call to begin with; simply an inline instruction sequence would be
> fine, and only those callers that override the sequence would need to
> patch.  It's possible to write clever macros to assure there is always
> space for a 5 byte call.
> 

It's functionally speaking the same thing... the advantage with starting 
out with the call and then patch in the native code as opposed to the 
other way around is to be able to handle things properly before we're 
ready to run the patching code.

Right now a number of the call sites contain a huge push/pop sequence 
followed by an indirect call.  We can patch in the native code to avoid 
the branch overhead, but the register constraints and icache footprint 
is unchanged.

	-hpa


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 23:52                     ` H. Peter Anvin
@ 2009-01-23  0:08                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-23  0:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Zachary Amsden, Nick Piggin, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

H. Peter Anvin wrote:
> Right now a number of the call sites contain a huge push/pop sequence 
> followed by an indirect call.  We can patch in the native code to 
> avoid the branch overhead, but the register constraints and icache 
> footprint is unchanged.

That's true for the pvops hooks emitted in the .S files, but not so true 
for ones in C code (well, there are no explicit push/pops, but the 
presence of the call may cause the compiler to generate them).

The .S hooks can definitely be cleaned up, but I don't think that's 
germane to Nick's observations that the mm code is showing slowdowns.

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:44               ` Jeremy Fitzhardinge
  2009-01-22 22:49                 ` H. Peter Anvin
@ 2009-01-22 22:55                 ` Zachary Amsden
  2009-01-23  0:14                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 32+ messages in thread
From: Zachary Amsden @ 2009-01-22 22:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

On Thu, 2009-01-22 at 14:44 -0800, Jeremy Fitzhardinge wrote:

> I did have some patches to do that at one point.  If you set pgd_val = 
> paravirt_nop, then the patching machinery will completely nop out the 
> call site.  The problem is that it depends on the calling convention 
> using the same regs for the first arg and return - true for 32-bit, but 
> not 64.  We could fix that with identity functions which the patcher 
> recognizes and can replace with either pure nops or inline appropriate 
> register moves.

What about removing the identity functions entirely.  They are useless,
really.  All that is needed is a patch site filled with nops for Xen to
overwrite, just stuffing the value into the proper registers.  For
64-bit, it can be a simple mov to satisfy the constraints.

> Also, I just posted patches to get rid of all pvops calls when fetching 
> or setting flags in a pte, which I hope will help.

Sounds like it will help.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:55                 ` Zachary Amsden
@ 2009-01-23  0:14                   ` Jeremy Fitzhardinge
  2009-01-27  7:59                     ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-23  0:14 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Nick Piggin, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

Zachary Amsden wrote:
> What about removing the identity functions entirely.  They are useless,
> really.  All that is needed is a patch site filled with nops for Xen to
> overwrite, just stuffing the value into the proper registers.  For
> 64-bit, it can be a simple mov to satisfy the constraints.
>   

I think it comes to the same thing really.  Both end up generating a 
series of nops with values entering and leaving in well-defined 
registers.  The x86-64 calling convention is a bit awkward because the 
first arg is in rdi and the ret is rax, so it can't quite be pure nops, 
or we use a non-standard calling-convention with appropriate thunks to 
call into C code.  I think a mov is a better performance-complexity 
tradeoff.

>> Also, I just posted patches to get rid of all pvops calls when fetching 
>> or setting flags in a pte, which I hope will help.
>>     
>
> Sounds like it will help.
>   

...but apparently not.

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-23  0:14                   ` Jeremy Fitzhardinge
@ 2009-01-27  7:59                     ` Ingo Molnar
  2009-01-27  8:24                       ` Jeremy Fitzhardinge
  2009-01-27 10:17                       ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-27  7:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, Nick Piggin, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>>> Also, I just posted patches to get rid of all pvops calls when 
>>> fetching or setting flags in a pte, which I hope will help.
>>
>> Sounds like it will help.
>
> ...but apparently not.

ping?

This is a very serious paravirt_ops slowdown affecting the native kernel's 
performance to the tune of 5-10% in certain workloads.

It's been about 2 years ago that paravirt_ops went upstream, when you told 
us that something like this would never happen, that paravirt_ops is 
designed so flexibly that it will never hinder the native kernel - and if 
it does it will be easy to fix it. Now is the time to fulfill that 
promise.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-27  7:59                     ` Ingo Molnar
@ 2009-01-27  8:24                       ` Jeremy Fitzhardinge
  2009-01-27 10:17                       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-27  8:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Nick Piggin, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

Ingo Molnar wrote:
> This is a very serious paravirt_ops slowdown affecting the native kernel's 
> performance to the tune of 5-10% in certain workloads.
>
> It's been about 2 years ago that paravirt_ops went upstream, when you told 
> us that something like this would never happen, that paravirt_ops is 
> designed so flexibly that it will never hinder the native kernel - and if 
> it does it will be easy to fix it. Now is the time to fulfill that 
> promise.
>   

Yep, working on it.

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-27  7:59                     ` Ingo Molnar
  2009-01-27  8:24                       ` Jeremy Fitzhardinge
@ 2009-01-27 10:17                       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-27 10:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Nick Piggin, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton,
	Xen-devel

[-- Attachment #1: Type: text/plain, Size: 1640 bytes --]

Ingo Molnar wrote:
> ping?
>
> This is a very serious paravirt_ops slowdown affecting the native kernel's 
> performance to the tune of 5-10% in certain workloads.
>
> It's been about 2 years ago that paravirt_ops went upstream, when you told 
> us that something like this would never happen, that paravirt_ops is 
> designed so flexibly that it will never hinder the native kernel - and if 
> it does it will be easy to fix it. Now is the time to fulfill that 
> promise.

I couldn't exactly reproduce your results, but I guess they're similar 
in shape.  Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 
1-5).  Interestingly I'm seeing identical instruction counts for pvops 
vs non-pvops, and a lower cycle count.  The cache references are way up 
and the miss rate is up a bit, which I guess is the source of the slowdown.

With the attached patch, I get a clear improvement; it replaces the 
do-nothing pte_val/make_pte functions with inlined movs to move the 
argument to return, overpatching the 6-byte indirect call (on i386 it 
would just be all nopped out).  CPU cycles and cache misses are way 
down, and the tick count is down from ~5% worse to ~2%.  But the cache 
reference rate is even higher, which really doesn't make sense to me. 
But the patch is a clear improvement, and its hard to see how it could 
make anything worse (its always going to replace an indirect call with 
simple inlined code).

(Full numbers in spreadsheet.)

I have a couple of other patches to reduce the register pressure of the 
pvops calls, but I'm trying to work out how to make sure its not all to 
complex and/or fragile.

    J

[-- Attachment #2: pvops-mmap-measurements.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 30546 bytes --]

[-- Attachment #3: paravirt-ident.patch --]
[-- Type: text/plain, Size: 6903 bytes --]

Subject: x86/pvops: add a paravirt_indent functions to allow special patching

Several paravirt ops implementations simply return their arguments,
the most obvious being the make_pte/pte_val class of operations on
native.

On 32-bit, the identity function is literally a no-op, as the calling
convention uses the same registers for the first argument and return.
On 64-bit, it can be implemented with a single "mov".

This patch adds special identity functions for 32 and 64 bit argument,
and machinery to recognize them and replace them with either nops or a
mov as appropriate.

At the moment, the only users for the identity functions are the
pagetable entry conversion functions.

The result is a measureable improvement on pagetable-heavy benchmarks
(2-3%, reducing the pvops overhead from 5 to 2%).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/include/asm/paravirt.h     |    5 ++
 arch/x86/kernel/paravirt.c          |   75 ++++++++++++++++++++++++++++++-----
 arch/x86/kernel/paravirt_patch_32.c |   12 +++++
 arch/x86/kernel/paravirt_patch_64.c |   15 +++++++
 4 files changed, 98 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -390,6 +390,8 @@
 	asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
 
 unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
 			     const void *target, u16 tgt_clobbers,
@@ -1378,6 +1380,9 @@
 }
 
 void _paravirt_nop(void);
+u32 _paravirt_ident_32(u32);
+u64 _paravirt_ident_64(u64);
+
 #define paravirt_nop	((void *)_paravirt_nop)
 
 void paravirt_use_bytelocks(void);
===================================================================
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -44,6 +44,17 @@
 {
 }
 
+/* identity function, which can be inlined */
+u32 _paravirt_ident_32(u32 x)
+{
+	return x;
+}
+
+u64 _paravirt_ident_64(u64 x)
+{
+	return x;
+}
+
 static void __init default_banner(void)
 {
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -138,9 +149,16 @@
 	if (opfunc == NULL)
 		/* If there's no function, patch it with a ud2a (BUG) */
 		ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
-	else if (opfunc == paravirt_nop)
+	else if (opfunc == _paravirt_nop)
 		/* If the operation is a nop, then nop the callsite */
 		ret = paravirt_patch_nop();
+
+	/* identity functions just return their single argument */
+	else if (opfunc == _paravirt_ident_32)
+		ret = paravirt_patch_ident_32(insnbuf, len);
+	else if (opfunc == _paravirt_ident_64)
+		ret = paravirt_patch_ident_64(insnbuf, len);
+
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
@@ -373,6 +391,45 @@
 #endif
 };
 
+typedef pte_t make_pte_t(pteval_t);
+typedef pmd_t make_pmd_t(pmdval_t);
+typedef pud_t make_pud_t(pudval_t);
+typedef pgd_t make_pgd_t(pgdval_t);
+
+typedef pteval_t pte_val_t(pte_t);
+typedef pmdval_t pmd_val_t(pmd_t);
+typedef pudval_t pud_val_t(pud_t);
+typedef pgdval_t pgd_val_t(pgd_t);
+
+
+#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE)
+/* 32-bit pagetable entries */
+#define paravirt_native_make_pte	(make_pte_t *)_paravirt_ident_32
+#define paravirt_native_pte_val		(pte_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pmd	(make_pmd_t *)_paravirt_ident_32
+#define paravirt_native_pmd_val		(pmd_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pud	(make_pud_t *)_paravirt_ident_32
+#define paravirt_native_pud_val		(pud_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pgd	(make_pgd_t *)_paravirt_ident_32
+#define paravirt_native_pgd_val		(pgd_val_t *)_paravirt_ident_32
+#else
+/* 64-bit pagetable entries */
+#define paravirt_native_make_pte	(make_pte_t *)_paravirt_ident_64
+#define paravirt_native_pte_val		(pte_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pmd	(make_pmd_t *)_paravirt_ident_64
+#define paravirt_native_pmd_val		(pmd_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pud	(make_pud_t *)_paravirt_ident_64
+#define paravirt_native_pud_val		(pud_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pgd	(make_pgd_t *)_paravirt_ident_64
+#define paravirt_native_pgd_val		(pgd_val_t *)_paravirt_ident_64
+#endif
+
 struct pv_mmu_ops pv_mmu_ops = {
 #ifndef CONFIG_X86_64
 	.pagetable_setup_start = native_pagetable_setup_start,
@@ -424,21 +481,21 @@
 	.pmd_clear = native_pmd_clear,
 #endif
 	.set_pud = native_set_pud,
-	.pmd_val = native_pmd_val,
-	.make_pmd = native_make_pmd,
+	.pmd_val = paravirt_native_pmd_val,
+	.make_pmd = paravirt_native_make_pmd,
 
 #if PAGETABLE_LEVELS == 4
-	.pud_val = native_pud_val,
-	.make_pud = native_make_pud,
+	.pud_val = paravirt_native_pud_val,
+	.make_pud = paravirt_native_make_pud,
 	.set_pgd = native_set_pgd,
 #endif
 #endif /* PAGETABLE_LEVELS >= 3 */
 
-	.pte_val = native_pte_val,
-	.pgd_val = native_pgd_val,
+	.pte_val = paravirt_native_pte_val,
+	.pgd_val = paravirt_native_pgd_val,
 
-	.make_pte = native_make_pte,
-	.make_pgd = native_make_pgd,
+	.make_pte = paravirt_native_make_pte,
+	.make_pgd = paravirt_native_make_pgd,
 
 	.dup_mmap = paravirt_nop,
 	.exit_mmap = paravirt_nop,
===================================================================
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,18 @@
 DEF_NATIVE(pv_cpu_ops, clts, "clts");
 DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
 
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+	/* arg in %eax, return in %eax */
+	return 0;
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+	/* arg in %edx:%eax, return in %edx:%eax */
+	return 0;
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
===================================================================
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -19,6 +19,21 @@
 DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl");
 DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
 
+DEF_NATIVE(, mov32, "mov %edi, %eax");
+DEF_NATIVE(, mov64, "mov %rdi, %rax");
+
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov32, end__mov32);
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov64, end__mov64);
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 11:26 ` Ingo Molnar
  2009-01-20 12:34   ` Nick Piggin
  2009-01-20 14:03   ` Ingo Molnar
@ 2009-01-20 19:05   ` Zachary Amsden
  2009-01-20 19:31     ` Ingo Molnar
  2009-01-22 22:26   ` Jeremy Fitzhardinge
  3 siblings, 1 reply; 32+ messages in thread
From: Zachary Amsden @ 2009-01-20 19:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds,
	hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org,
	rusty@rustcorp.com.au

On Tue, 2009-01-20 at 03:26 -0800, Ingo Molnar wrote:

> Jeremy, any ideas where this slowdown comes from and how it could be 
> fixed?

Well I'm early responding to this thread before reading on, but I looked
at the generated assembly for some common mm paths and it looked awful.
The biggest loser was probably having functions to convert pte_t back
and forth to pteval_t, which makes most potential mask / shift
optimizations impossible - indeed, because the compiler doesn't even
understand pte_val(X) = Y is static over the lifetime of the function,
it often calls these same conversions back and forth several times, and
because this is often done inside hidden macros, it's not even possible
to save a cached value in most places.

The bulk of state required to keep this extra conversion around ties up
a lot of registers and as a result heavily limits potential further
optimizations.

The code did not look more branchy to me, however, and gcc seemed to do
a good job with lining up a nice branch structure in the few paths I
looked at.

Zach

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 19:05   ` Zachary Amsden
@ 2009-01-20 19:31     ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2009-01-20 19:31 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds,
	hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org,
	rusty@rustcorp.com.au


* Zachary Amsden <zach@vmware.com> wrote:

> On Tue, 2009-01-20 at 03:26 -0800, Ingo Molnar wrote:
> 
> > Jeremy, any ideas where this slowdown comes from and how it could be 
> > fixed?
> 
> Well I'm early responding to this thread before reading on, but I looked 
> at the generated assembly for some common mm paths and it looked awful. 
> The biggest loser was probably having functions to convert pte_t back 
> and forth to pteval_t, which makes most potential mask / shift 
> optimizations impossible - indeed, because the compiler doesn't even 
> understand pte_val(X) = Y is static over the lifetime of the function, 
> it often calls these same conversions back and forth several times, and 
> because this is often done inside hidden macros, it's not even possible 
> to save a cached value in most places.
> 
> The bulk of state required to keep this extra conversion around ties up 
> a lot of registers and as a result heavily limits potential further 
> optimizations.
> 
> The code did not look more branchy to me, however, and gcc seemed to do 
> a good job with lining up a nice branch structure in the few paths I 
> looked at.

i've extended my mmap test with branch execution hw-perfcounter stats:

 -----------------------------------------------
 | Performance counter stats for './mmap-perf' |
 -----------------------------------------------
 |                |
 |  x86-defconfig |   PARAVIRT=y
 |------------------------------------------------------------------
 |
 |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
 |                |
 |              1 |            1  CPU migrations
 |             91 |           79  context switches
 |          55945 |        55943  pagefaults
 |    ............................................
 |     3781392474 |   3918777174  CPU cycles                  +3.63%
 |     1957153827 |   2161280486  instructions               +10.43%
 |       50234816 |     51303520  cache references            +2.12%
 |        5428258 |      5583728  cache misses                +2.86%
 |
 |      437983499 |    478967061  branches                    +9.36%
 |       32486067 |     32336874  branch-misses               -0.46%
 |                |
 |    1314.782469 |  1363.694447  time elapsed (msecs)        +3.72%
 |                |
 -----------------------------------

So we execute 9.36% more branches - i.e. very noticeably higher as well. 

The CPU predicts them slightly more effectively though, the -0.46% for 
branch-misses is well above measurement noise (of ~0.02% for the branch 
metric) so it's a systematic effect.

Non-functional 'boring' bloat tends to be easier to predict so it's not 
necessarily a real surprise. That also explains why despite +10.43% more 
instructions the total cycle count went up by a comparatively smaller 
+3.63%.

[ that's 64-bit x86 btw. ]

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-20 11:26 ` Ingo Molnar
                     ` (2 preceding siblings ...)
  2009-01-20 19:05   ` Zachary Amsden
@ 2009-01-22 22:26   ` Jeremy Fitzhardinge
  2009-01-22 23:04     ` Ingo Molnar
  3 siblings, 1 reply; 32+ messages in thread
From: Jeremy Fitzhardinge @ 2009-01-22 22:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa,
	jeremy, chrisw, zach, rusty

Ingo Molnar wrote:
> Ouch, that looks unacceptably expensive. All the major distros turn 
> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
> promise to have no measurable runtime overhead.
>
> ( And i suspect the real life mmap cost is probably even more expensive,
>   as on a Barcelona all of lmbench fits into the cache hence we dont see
>   any real $cache overhead. )
>
> Jeremy, any ideas where this slowdown comes from and how it could be 
> fixed?
>   

I just posted a couple of patches to pick some low-hanging fruit.  It 
turns out that we don't need to do any pvops calls to do pte flag 
manipulations.  I'd be interested to see how much of a difference it 
makes (it reduces the static code size by a few k).

    J

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 22:26   ` Jeremy Fitzhardinge
@ 2009-01-22 23:04     ` Ingo Molnar
  2009-01-22 23:30       ` Zachary Amsden
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2009-01-22 23:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa,
	jeremy, chrisw, zach, rusty


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Ouch, that looks unacceptably expensive. All the major distros turn  
>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express 
>> promise to have no measurable runtime overhead.
>>
>> ( And i suspect the real life mmap cost is probably even more expensive,
>>   as on a Barcelona all of lmbench fits into the cache hence we dont see
>>   any real $cache overhead. )
>>
>> Jeremy, any ideas where this slowdown comes from and how it could be  
>> fixed?
>>   
>
> I just posted a couple of patches to pick some low-hanging fruit.  It 
> turns out that we don't need to do any pvops calls to do pte flag 
> manipulations.  I'd be interested to see how much of a difference it 
> makes (it reduces the static code size by a few k).

I've tried your patches - but can see no significant reduction in 
overhead. I've updated my table with numbers from your patches:

 -----------------------------------------------
 | Performance counter stats for './mmap-perf' |
 -----------------------------------------------
 |            |            |
 | defconfig  | PARAVIRT=y |    +Jeremy
 |-----------------------------------------------------------------------
 |
 | 1311.55452 | 1360.62493 | 1378.94464  task clock (msecs)        +3.74%
 |            |            |
 |          1 |          1 |          0  CPU migrations
 |         91 |         79 |         77  context switches
 |      55945 |      55943 |      55980  pagefaults
 |.......................................................................
 | 3781392474 | 3918777174 | 3907189795  CPU cycles                +3.63%
 | 1957153827 | 2161280486 | 2161741689  instructions             +10.43%
 |   50234816 |   51303520 |   50619593  cache references          +2.12%
 |    5428258 |    5583728 |    5575808  cache misses              +2.86%
 |
 |  437983499 |  478967061 |  479053595  branches                  +9.36%
 |   32486067 |   32336874 |   32377710  branch-misses             -0.46%
 |            |
 | 1314.78246 | 1363.69444 | 1357.58161  time elapsed (msecs)       +3.72%
 |            |
 ------------------------------------------------------------------------

'+Jeremy' is a CONFIG_PARAVIRT=y run done with your patches.

The most stable count is the instruction count:

 | 1957153827 | 2161280486 | 2161741689  instructions             +10.43%

But your two patches did not reduce the instruction count in any 
measurable way.

In any case, it is rather inefficient of me proxy-testing your patches, 
you can do these measurements yourself too on any Core2 or later Intel 
CPU, by running tip/master plus picking up these two utilities:

    http://people.redhat.com/mingo/perfcounters/perfstat.c
    http://redhat.com/~mingo/misc/mmap-perf.c

building them and running this (as root):

    taskset 1 ./perfstat ./mmap-perf 1

it will give you numbers like the ones above.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
  2009-01-22 23:04     ` Ingo Molnar
@ 2009-01-22 23:30       ` Zachary Amsden
  0 siblings, 0 replies; 32+ messages in thread
From: Zachary Amsden @ 2009-01-22 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Nick Piggin, Linux Kernel Mailing List,
	Linus Torvalds, hpa@zytor.com, jeremy@xensource.com,
	chrisw@sous-sol.org, rusty@rustcorp.com.au

[-- Attachment #1: Type: text/plain, Size: 650 bytes --]

On Thu, 2009-01-22 at 15:04 -0800, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> In any case, it is rather inefficient of me proxy-testing your patches, 
> you can do these measurements yourself too on any Core2 or later Intel 
> CPU, by running tip/master plus picking up these two utilities:

Eek, I have no time to spend on this right now, but if anyone is curious
to run this patch (which heavily breaks Xen), I suspect it will cure
most of the performance ailments.

Back when we did the VMI prototyping, we never saw any significant
benchmark reductions until the introduction of the M-to-P conversion
functions.

Zach

[-- Attachment #2: paravirt-drop-mpn-ops.patch --]
[-- Type: text/x-patch, Size: 4338 bytes --]

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index e9873a2..e5b39db 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -155,10 +155,6 @@ static inline pteval_t native_pte_flags(pte_t pte)
 #define pgprot_val(x)	((x).pgprot)
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
-#ifdef CONFIG_PARAVIRT
-#include <asm/paravirt.h>
-#else  /* !CONFIG_PARAVIRT */
-
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)
 
@@ -176,6 +172,8 @@ static inline pteval_t native_pte_flags(pte_t pte)
 #define pte_flags(x)	native_pte_flags(x)
 #define __pte(x)	native_make_pte(x)
 
+#ifdef CONFIG_PARAVIRT
+#include <asm/paravirt.h>
 #endif	/* CONFIG_PARAVIRT */
 
 #define __pa(x)		__phys_addr((unsigned long)(x))
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ba3e2ff..7de3169 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -278,13 +278,6 @@ struct pv_mmu_ops {
 	void (*ptep_modify_prot_commit)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, pte_t pte);
 
-	pteval_t (*pte_val)(pte_t);
-	pteval_t (*pte_flags)(pte_t);
-	pte_t (*make_pte)(pteval_t pte);
-
-	pgdval_t (*pgd_val)(pgd_t);
-	pgd_t (*make_pgd)(pgdval_t pgd);
-
 #if PAGETABLE_LEVELS >= 3
 #ifdef CONFIG_X86_PAE
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
@@ -298,13 +291,7 @@ struct pv_mmu_ops {
 
 	void (*set_pud)(pud_t *pudp, pud_t pudval);
 
-	pmdval_t (*pmd_val)(pmd_t);
-	pmd_t (*make_pmd)(pmdval_t pmd);
-
 #if PAGETABLE_LEVELS == 4
-	pudval_t (*pud_val)(pud_t);
-	pud_t (*make_pud)(pudval_t pud);
-
 	void (*set_pgd)(pgd_t *pudp, pgd_t pgdval);
 #endif	/* PAGETABLE_LEVELS == 4 */
 #endif	/* PAGETABLE_LEVELS >= 3 */
@@ -1054,81 +1041,6 @@ static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
-static inline pte_t __pte(pteval_t val)
-{
-	pteval_t ret;
-
-	if (sizeof(pteval_t) > sizeof(long))
-		ret = PVOP_CALL2(pteval_t,
-				 pv_mmu_ops.make_pte,
-				 val, (u64)val >> 32);
-	else
-		ret = PVOP_CALL1(pteval_t,
-				 pv_mmu_ops.make_pte,
-				 val);
-
-	return (pte_t) { .pte = ret };
-}
-
-static inline pteval_t pte_val(pte_t pte)
-{
-	pteval_t ret;
-
-	if (sizeof(pteval_t) > sizeof(long))
-		ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_val,
-				 pte.pte, (u64)pte.pte >> 32);
-	else
-		ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_val,
-				 pte.pte);
-
-	return ret;
-}
-
-static inline pteval_t pte_flags(pte_t pte)
-{
-	pteval_t ret;
-
-	if (sizeof(pteval_t) > sizeof(long))
-		ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_flags,
-				 pte.pte, (u64)pte.pte >> 32);
-	else
-		ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_flags,
-				 pte.pte);
-
-#ifdef CONFIG_PARAVIRT_DEBUG
-	BUG_ON(ret & PTE_PFN_MASK);
-#endif
-	return ret;
-}
-
-static inline pgd_t __pgd(pgdval_t val)
-{
-	pgdval_t ret;
-
-	if (sizeof(pgdval_t) > sizeof(long))
-		ret = PVOP_CALL2(pgdval_t, pv_mmu_ops.make_pgd,
-				 val, (u64)val >> 32);
-	else
-		ret = PVOP_CALL1(pgdval_t, pv_mmu_ops.make_pgd,
-				 val);
-
-	return (pgd_t) { ret };
-}
-
-static inline pgdval_t pgd_val(pgd_t pgd)
-{
-	pgdval_t ret;
-
-	if (sizeof(pgdval_t) > sizeof(long))
-		ret =  PVOP_CALL2(pgdval_t, pv_mmu_ops.pgd_val,
-				  pgd.pgd, (u64)pgd.pgd >> 32);
-	else
-		ret =  PVOP_CALL1(pgdval_t, pv_mmu_ops.pgd_val,
-				  pgd.pgd);
-
-	return ret;
-}
-
 #define  __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
 static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr,
 					   pte_t *ptep)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index e4c8fb6..ac48a2d 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -424,23 +424,12 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.pmd_clear = native_pmd_clear,
 #endif
 	.set_pud = native_set_pud,
-	.pmd_val = native_pmd_val,
-	.make_pmd = native_make_pmd,
 
 #if PAGETABLE_LEVELS == 4
-	.pud_val = native_pud_val,
-	.make_pud = native_make_pud,
 	.set_pgd = native_set_pgd,
 #endif
 #endif /* PAGETABLE_LEVELS >= 3 */
 
-	.pte_val = native_pte_val,
-	.pte_flags = native_pte_flags,
-	.pgd_val = native_pgd_val,
-
-	.make_pte = native_make_pte,
-	.make_pgd = native_make_pgd,
-
 	.dup_mmap = paravirt_nop,
 	.exit_mmap = paravirt_nop,
 	.activate_mm = paravirt_nop,

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-01-27 10:17 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-20 11:05 lmbench lat_mmap slowdown with CONFIG_PARAVIRT Nick Piggin
2009-01-20 11:26 ` Ingo Molnar
2009-01-20 12:34   ` Nick Piggin
2009-01-20 12:45     ` Ingo Molnar
2009-01-20 13:41       ` Nick Piggin
2009-01-20 14:03   ` Ingo Molnar
2009-01-20 14:14     ` Nick Piggin
2009-01-20 14:17       ` Ingo Molnar
2009-01-20 14:41         ` Nick Piggin
2009-01-20 15:00           ` Ingo Molnar
2009-01-20 15:13     ` Ingo Molnar
2009-01-20 19:37     ` Ingo Molnar
2009-01-20 20:45     ` Jeremy Fitzhardinge
2009-01-20 20:56       ` Ingo Molnar
2009-01-21  7:27         ` Nick Piggin
2009-01-21 22:23           ` Jeremy Fitzhardinge
2009-01-22 22:28             ` Zachary Amsden
2009-01-22 22:44               ` Jeremy Fitzhardinge
2009-01-22 22:49                 ` H. Peter Anvin
2009-01-22 22:58                   ` Zachary Amsden
2009-01-22 23:52                     ` H. Peter Anvin
2009-01-23  0:08                       ` Jeremy Fitzhardinge
2009-01-22 22:55                 ` Zachary Amsden
2009-01-23  0:14                   ` Jeremy Fitzhardinge
2009-01-27  7:59                     ` Ingo Molnar
2009-01-27  8:24                       ` Jeremy Fitzhardinge
2009-01-27 10:17                       ` Jeremy Fitzhardinge
2009-01-20 19:05   ` Zachary Amsden
2009-01-20 19:31     ` Ingo Molnar
2009-01-22 22:26   ` Jeremy Fitzhardinge
2009-01-22 23:04     ` Ingo Molnar
2009-01-22 23:30       ` Zachary Amsden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox