* lmbench lat_mmap slowdown with CONFIG_PARAVIRT @ 2009-01-20 11:05 Nick Piggin 2009-01-20 11:26 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Nick Piggin @ 2009-01-20 11:05 UTC (permalink / raw) To: Linux Kernel Mailing List, Ingo Molnar, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty Hi, I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed down. On further investigation, a large part of this is not due to a _regression_ as such, but the introduction of CONFIG_PARAVIRT=y. Now, it is true that lat_mmap is basically a microbenchmark, however it is exercising the memory mapping and page fault handler paths, so we're talking about pretty important paths here. So I think it should be of interest. I've run the tests on a 2s8c AMD Barcelona system, binding the test to CPU0, and running 100 times (stddev is a bit hard to bring down, and my scripts needed 100 runs in order to pick up much smaller changes in the results -- for CONFIG_PARAVIRT, just a couple of runs should show up the problem). Times I believe are in nanoseconds for lmbench, anyway lower is better. non pv AVG=464.22 STD=5.56 paravirt AVG=502.87 STD=7.36 Nearly 10% performance drop here, which is quite a bit... hopefully people are testing the speed of their PV implementations against non-PV bare metal :) CPU: AMD64 family10, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10000 samples % symbol name BASE 49749 32.6336 read_tsc 10151 6.6587 __up_read 7363 4.8299 unmap_vmas 7072 4.6390 mnt_drop_write 4107 2.6941 do_page_fault 4090 2.6829 rb_get_reader_page 3601 2.3621 apic_timer_interrupt 3537 2.3202 set_page_dirty 3435 2.2532 mnt_want_write 3302 2.1660 default_idle 2990 1.9613 idle_cpu 2904 1.9049 file_update_time 2789 1.8295 set_page_dirty_balance 2712 1.7790 retint_swapgs 2455 1.6104 __do_fault 2231 1.4635 release_pages 1989 1.3047 rb_buffer_peek 1895 1.2431 ring_buffer_consume 1572 1.0312 handle_mm_fault 1554 1.0194 put_page 1461 0.9584 sync_buffer 1196 0.7845 clear_page_c 1145 0.7511 rb_advance_reader 1144 0.7504 hweight64 1084 0.7111 getnstimeofday 1076 0.7058 __set_page_dirty_no_writeback 1020 0.6691 mark_page_accessed 751 0.4926 tick_do_update_jiffies64 CONFIG_PARAVIRT 8924 7.8849 native_safe_halt 8823 7.7957 native_read_tsc 6201 5.4790 default_spin_lock_flags 5806 5.1300 unmap_vmas 3996 3.5307 handle_mm_fault 3954 3.4936 rb_get_reader_page 3752 3.3151 __do_fault 2908 2.5694 getnstimeofday 2303 2.0348 apic_timer_interrupt 2183 1.9288 find_busiest_group 2057 1.8175 do_page_fault 2057 1.8175 hweight64 2017 1.7821 set_page_dirty 1926 1.7017 get_next_timer_interrupt 1781 1.5736 release_pages 1702 1.5038 native_pte_val 1620 1.4314 native_sched_clock 1588 1.4031 rebalance_domains 1558 1.3766 run_timer_softirq 1531 1.3527 __down_read_trylock 1505 1.3298 native_pmd_val 1445 1.2767 find_get_page 1369 1.2096 find_next_bit 1356 1.1981 __ticket_spin_lock 1225 1.0824 shmem_getpage 1169 1.0329 radix_tree_lookup_slot 1166 1.0302 vm_normal_page 987 0.8721 scheduler_tick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 11:05 lmbench lat_mmap slowdown with CONFIG_PARAVIRT Nick Piggin @ 2009-01-20 11:26 ` Ingo Molnar 2009-01-20 12:34 ` Nick Piggin ` (3 more replies) 0 siblings, 4 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 11:26 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty * Nick Piggin <npiggin@suse.de> wrote: > Hi, > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed > down. On further investigation, a large part of this is not due to a > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y. > > Now, it is true that lat_mmap is basically a microbenchmark, however it > is exercising the memory mapping and page fault handler paths, so we're > talking about pretty important paths here. So I think it should be of > interest. > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to > CPU0, and running 100 times (stddev is a bit hard to bring down, and my > scripts needed 100 runs in order to pick up much smaller changes in the > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the > problem). > > Times I believe are in nanoseconds for lmbench, anyway lower is better. > > non pv AVG=464.22 STD=5.56 > paravirt AVG=502.87 STD=7.36 > > Nearly 10% performance drop here, which is quite a bit... hopefully > people are testing the speed of their PV implementations against non-PV > bare metal :) Ouch, that looks unacceptably expensive. All the major distros turn CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express promise to have no measurable runtime overhead. ( And i suspect the real life mmap cost is probably even more expensive, as on a Barcelona all of lmbench fits into the cache hence we dont see any real $cache overhead. ) Jeremy, any ideas where this slowdown comes from and how it could be fixed? Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 11:26 ` Ingo Molnar @ 2009-01-20 12:34 ` Nick Piggin 2009-01-20 12:45 ` Ingo Molnar 2009-01-20 14:03 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 32+ messages in thread From: Nick Piggin @ 2009-01-20 12:34 UTC (permalink / raw) To: Ingo Molnar Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > Hi, > > > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed > > down. On further investigation, a large part of this is not due to a > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y. > > > > Now, it is true that lat_mmap is basically a microbenchmark, however it > > is exercising the memory mapping and page fault handler paths, so we're > > talking about pretty important paths here. So I think it should be of > > interest. > > > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my > > scripts needed 100 runs in order to pick up much smaller changes in the > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the > > problem). > > > > Times I believe are in nanoseconds for lmbench, anyway lower is better. > > > > non pv AVG=464.22 STD=5.56 > > paravirt AVG=502.87 STD=7.36 > > > > Nearly 10% performance drop here, which is quite a bit... hopefully > > people are testing the speed of their PV implementations against non-PV > > bare metal :) > > Ouch, that looks unacceptably expensive. All the major distros turn > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > promise to have no measurable runtime overhead. > > ( And i suspect the real life mmap cost is probably even more expensive, > as on a Barcelona all of lmbench fits into the cache hence we dont see > any real $cache overhead. ) The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and kernel/. Definitely we don't see the worst of the icache or branch buffer overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( ) > Jeremy, any ideas where this slowdown comes from and how it could be > fixed? I had a bit of a poke around the profiles, but nothing stood out. However oprofile counted 50% more cycles in the kernel with PV than with non-PV. I'll have to take a look at the user/system times, because 50% seems ludicrous.... hopefully it's just oprofile noise. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 12:34 ` Nick Piggin @ 2009-01-20 12:45 ` Ingo Molnar 2009-01-20 13:41 ` Nick Piggin 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 12:45 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty * Nick Piggin <npiggin@suse.de> wrote: > On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote: > > > > * Nick Piggin <npiggin@suse.de> wrote: > > > > > Hi, > > > > > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed > > > down. On further investigation, a large part of this is not due to a > > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y. > > > > > > Now, it is true that lat_mmap is basically a microbenchmark, however it > > > is exercising the memory mapping and page fault handler paths, so we're > > > talking about pretty important paths here. So I think it should be of > > > interest. > > > > > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to > > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my > > > scripts needed 100 runs in order to pick up much smaller changes in the > > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the > > > problem). > > > > > > Times I believe are in nanoseconds for lmbench, anyway lower is better. > > > > > > non pv AVG=464.22 STD=5.56 > > > paravirt AVG=502.87 STD=7.36 > > > > > > Nearly 10% performance drop here, which is quite a bit... hopefully > > > people are testing the speed of their PV implementations against non-PV > > > bare metal :) > > > > Ouch, that looks unacceptably expensive. All the major distros turn > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > > promise to have no measurable runtime overhead. > > > > ( And i suspect the real life mmap cost is probably even more expensive, > > as on a Barcelona all of lmbench fits into the cache hence we dont see > > any real $cache overhead. ) > > The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and > kernel/. Definitely we don't see the worst of the icache or branch buffer > overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( ) > > > > Jeremy, any ideas where this slowdown comes from and how it could be > > fixed? > > I had a bit of a poke around the profiles, but nothing stood out. > However oprofile counted 50% more cycles in the kernel with PV than with > non-PV. I'll have to take a look at the user/system times, because 50% > seems ludicrous.... hopefully it's just oprofile noise. If you have a Core2 test-system could you please try tip/master, which also has your do_page_fault-de-bloating patch applied? <plug> The other advantage of tip/master would be that you could try precise performance counter measurements via: http://redhat.com/~mingo/perfcounters/timec.c and split out the lmbench test-case into a standalone .c file loop. Running it as: $ taskset 0 ./timec -e -5,-4,-3,0,1,2,3 ./mmap-test Will give you very precise information about what's going on in that workload: Performance counter stats for 'mmap-test': 628315.871980 task clock ticks (msecs) 42330 CPU migrations (events) 124980 context switches (events) 18698292 pagefaults (events) 1351875946010 CPU cycles (events) 1121901478363 instructions (events) 10654788968 cache references (events) 633581867 cache misses (events) You might also want to try an NMI profile via kerneltop: http://redhat.com/~mingo/perfcounters/kerneltop.c just run it with no arguments on a perfcounters kernel and it will give you something like: ------------------------------------------------------------------------------ KernelTop: 20297 irqs/sec [NMI, 10000 cache-misses], (all, 8 CPUs) ------------------------------------------------------------------------------ events RIP kernel function ______ ______ ________________ _______________ 12816.00 - ffffffff803d5760 : copy_user_generic_string! 11751.00 - ffffffff80647a2c : unix_stream_recvmsg 10215.00 - ffffffff805eda5f : sock_alloc_send_skb 9738.00 - ffffffff80284821 : flush_free_list 6749.00 - ffffffff802854a1 : __kmalloc_track_caller 3663.00 - ffffffff805f09fa : skb_dequeue 3591.00 - ffffffff80284be2 : kmem_cache_alloc [qla2xxx] 3501.00 - ffffffff805f15f5 : __alloc_skb 1296.00 - ffffffff803d8eb4 : list_del [qla2xxx] 1110.00 - ffffffff805f0ed2 : kfree_skb Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 12:45 ` Ingo Molnar @ 2009-01-20 13:41 ` Nick Piggin 0 siblings, 0 replies; 32+ messages in thread From: Nick Piggin @ 2009-01-20 13:41 UTC (permalink / raw) To: Ingo Molnar Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty On Tue, Jan 20, 2009 at 01:45:00PM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > On Tue, Jan 20, 2009 at 12:26:34PM +0100, Ingo Molnar wrote: > > > > > > * Nick Piggin <npiggin@suse.de> wrote: > > > > > > > Hi, > > > > > > > > I'm looking at regressions since 2.6.16, and one is lat_mmap has slowed > > > > down. On further investigation, a large part of this is not due to a > > > > _regression_ as such, but the introduction of CONFIG_PARAVIRT=y. > > > > > > > > Now, it is true that lat_mmap is basically a microbenchmark, however it > > > > is exercising the memory mapping and page fault handler paths, so we're > > > > talking about pretty important paths here. So I think it should be of > > > > interest. > > > > > > > > I've run the tests on a 2s8c AMD Barcelona system, binding the test to > > > > CPU0, and running 100 times (stddev is a bit hard to bring down, and my > > > > scripts needed 100 runs in order to pick up much smaller changes in the > > > > results -- for CONFIG_PARAVIRT, just a couple of runs should show up the > > > > problem). > > > > > > > > Times I believe are in nanoseconds for lmbench, anyway lower is better. > > > > > > > > non pv AVG=464.22 STD=5.56 > > > > paravirt AVG=502.87 STD=7.36 > > > > > > > > Nearly 10% performance drop here, which is quite a bit... hopefully > > > > people are testing the speed of their PV implementations against non-PV > > > > bare metal :) > > > > > > Ouch, that looks unacceptably expensive. All the major distros turn > > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > > > promise to have no measurable runtime overhead. > > > > > > ( And i suspect the real life mmap cost is probably even more expensive, > > > as on a Barcelona all of lmbench fits into the cache hence we dont see > > > any real $cache overhead. ) > > > > The PV kernel has over 100K larger text size, nearly 40K alone in mm/ and > > kernel/. Definitely we don't see the worst of the icache or branch buffer > > overhead on this microbenchmark. (wow, that's a nasty amount of bloat :( ) > > > > > > > Jeremy, any ideas where this slowdown comes from and how it could be > > > fixed? > > > > I had a bit of a poke around the profiles, but nothing stood out. > > However oprofile counted 50% more cycles in the kernel with PV than with > > non-PV. I'll have to take a look at the user/system times, because 50% > > seems ludicrous.... hopefully it's just oprofile noise. kbuild costs go up a bit (average of 30 builds) elapsed non-pv: AVG=53.31s STD=0.99 pv: AVG=53.54s STD=0.94 user non-pv: AVG=318.63s STD=0.19 pv: AVG=319.33s STD=0.23 system non-pv: AVG=30.56s STD=0.15 pv: AVG=31.80s STD=0.15 kernel side of the kbuild workload slows down by 4.1%. User time also increases a bit (probably more cache and branch misses). > If you have a Core2 test-system could you please try tip/master, which > also has your do_page_fault-de-bloating patch applied? Will try to get one to do some runs on. Thanks, Nick ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 11:26 ` Ingo Molnar 2009-01-20 12:34 ` Nick Piggin @ 2009-01-20 14:03 ` Ingo Molnar 2009-01-20 14:14 ` Nick Piggin ` (3 more replies) 2009-01-20 19:05 ` Zachary Amsden 2009-01-22 22:26 ` Jeremy Fitzhardinge 3 siblings, 4 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 14:03 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > > Times I believe are in nanoseconds for lmbench, anyway lower is > > better. > > > > non pv AVG=464.22 STD=5.56 > > paravirt AVG=502.87 STD=7.36 > > > > Nearly 10% performance drop here, which is quite a bit... hopefully > > people are testing the speed of their PV implementations against > > non-PV bare metal :) > > Ouch, that looks unacceptably expensive. All the major distros turn > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > promise to have no measurable runtime overhead. Here are some more precise stats done via hw counters on a perfcounters kernel using 'timec', running a modified version of the 'mmap performance stress-test' app i made years ago. The MM benchmark app can be downloaded from: http://redhat.com/~mingo/misc/mmap-perf.c timec.c can be picked up from: http://redhat.com/~mingo/perfcounters/timec.c mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches the mapped area as well with a certain chance. The patterns are pseudo-random and the random seed is initialized to the same value so repeated runs produce the exact same mmap sequence. I ran the test with a single thread and bound to a single core: # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1 [ I ran it as root - so that kernel-space hardware-counter statistics are included as well. ] The results are quite surprisingly candid about the true costs of paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y): ----------------------------------------------- | Performance counter stats for './mmap-perf' | ----------------------------------------------- | | | x86-defconfig | PARAVIRT=y |------------------------------------------------------------------ | | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% | | | 1 | 1 CPU migrations | 91 | 79 context switches | 55945 | 55943 pagefaults | ............................................ | 3781392474 | 3918777174 CPU cycles +3.63% | 1957153827 | 2161280486 instructions +10.43% | 50234816 | 51303520 cache references +2.12% | 5428258 | 5583728 cache misses +2.86% | | | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72% | | ----------------------------------- The most surprising element is that in the paravirt_ops case we run 204 million more instructions - out of the ~2000 million instructions total. That's an increase of over 10%! That shows the expected $cache risks here as well: i ran this on an Extreme Edition CPU which has a ton of L2 cache [4MB] which mutes L2 $cache misses quite a bit. Note that this workload tests a broader range of MM related codepaths - not just pure pagefault costs. Ingo ps. Measurement methodology: The software counters show that the test was indeed done on an idle system: there are no CPU migrations (the task is affine), nor any significant context-switches, and the pagefault count is essentially the same as well. (because this is a fully repeatable workload.) The numbers are a representative sample from a run of more than 10 testruns, on an otherwise idle system. Measurement noise is very low: 3906920196 CPU cycles (events) 3907556124 CPU cycles (events) 3907902335 CPU cycles (events) 3914423870 CPU cycles (events) 3915642464 CPU cycles (events) 3916134988 CPU cycles (events) 3916840093 CPU cycles (events) 3918777174 CPU cycles (events) 3918993251 CPU cycles (events) 3919907192 CPU cycles (events) The max/min spread of 10 runs is 0.3%, so the precision of this measurement is in the 0.1% range - more than enough to be conclusive. The max/min spread of the instruction counts is even better: in the 0.01% range. (that is because exactly the same workload is executed - only timer IRQs and small disturbances cause noise here.) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:03 ` Ingo Molnar @ 2009-01-20 14:14 ` Nick Piggin 2009-01-20 14:17 ` Ingo Molnar 2009-01-20 15:13 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 32+ messages in thread From: Nick Piggin @ 2009-01-20 14:14 UTC (permalink / raw) To: Ingo Molnar Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton On Tue, Jan 20, 2009 at 03:03:24PM +0100, Ingo Molnar wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > Times I believe are in nanoseconds for lmbench, anyway lower is > > > better. > > > > > > non pv AVG=464.22 STD=5.56 > > > paravirt AVG=502.87 STD=7.36 > > > > > > Nearly 10% performance drop here, which is quite a bit... hopefully > > > people are testing the speed of their PV implementations against > > > non-PV bare metal :) > > > > Ouch, that looks unacceptably expensive. All the major distros turn > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > > promise to have no measurable runtime overhead. > > Here are some more precise stats done via hw counters on a perfcounters > kernel using 'timec', running a modified version of the 'mmap performance > stress-test' app i made years ago. > > The MM benchmark app can be downloaded from: > > http://redhat.com/~mingo/misc/mmap-perf.c BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets compiled into a standalone lat_mmap exec by the standard lmbench build). ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:14 ` Nick Piggin @ 2009-01-20 14:17 ` Ingo Molnar 2009-01-20 14:41 ` Nick Piggin 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 14:17 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton * Nick Piggin <npiggin@suse.de> wrote: > On Tue, Jan 20, 2009 at 03:03:24PM +0100, Ingo Molnar wrote: > > > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > > > Times I believe are in nanoseconds for lmbench, anyway lower is > > > > better. > > > > > > > > non pv AVG=464.22 STD=5.56 > > > > paravirt AVG=502.87 STD=7.36 > > > > > > > > Nearly 10% performance drop here, which is quite a bit... hopefully > > > > people are testing the speed of their PV implementations against > > > > non-PV bare metal :) > > > > > > Ouch, that looks unacceptably expensive. All the major distros turn > > > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > > > promise to have no measurable runtime overhead. > > > > Here are some more precise stats done via hw counters on a perfcounters > > kernel using 'timec', running a modified version of the 'mmap performance > > stress-test' app i made years ago. > > > > The MM benchmark app can be downloaded from: > > > > http://redhat.com/~mingo/misc/mmap-perf.c > > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets > compiled into a standalone lat_mmap exec by the standard lmbench build). doesnt that include an indeterminate number of gettimeofday() based calibration calls? That would make it harder to measure its total costs in a comparative way. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:17 ` Ingo Molnar @ 2009-01-20 14:41 ` Nick Piggin 2009-01-20 15:00 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Nick Piggin @ 2009-01-20 14:41 UTC (permalink / raw) To: Ingo Molnar Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton On Tue, Jan 20, 2009 at 03:17:35PM +0100, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets > > compiled into a standalone lat_mmap exec by the standard lmbench build). > > doesnt that include an indeterminate number of gettimeofday() based > calibration calls? That would make it harder to measure its total costs in > a comparative way. Hmm... yes probably for really detailed profile comparisons or other external measurements it would need modification. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:41 ` Nick Piggin @ 2009-01-20 15:00 ` Ingo Molnar 0 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 15:00 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton * Nick Piggin <npiggin@suse.de> wrote: > On Tue, Jan 20, 2009 at 03:17:35PM +0100, Ingo Molnar wrote: > > > > * Nick Piggin <npiggin@suse.de> wrote: > > > > > > BTW. the lmbench test I run directly (it's called lat_mmap.c, and gets > > > compiled into a standalone lat_mmap exec by the standard lmbench build). > > > > doesnt that include an indeterminate number of gettimeofday() based > > calibration calls? That would make it harder to measure its total costs in > > a comparative way. > > Hmm... yes probably for really detailed profile comparisons or other > external measurements it would need modification. yeah. Btw., it's a trend to be aware of i think: as our commit flux goes up and the average commit size goes down, it becomes harder and harder to measure the per commit performance impact. There's just 3 ways to handle it: decrease commit flux (which is out of question), or to increase commits size (wich is out of question as well), or to improve the quality of our measurements. We can improve performance measurement quality in a number of ways: - We can (and should) increase instrumentation precision /usr/bin/time's 10 msec measurement granularity might have been fine a decade ago but it is not fine today. - We can (and should) increase the number of 'dimensions' (metrics) we can instrument the kernel with. Right now we basically only measure along the time axis, in 99% of the cases. But 'elapsed time' is a tricky, compound and thus noisy unit: it is affected by all delays in a workload. We do profiles occasionally, but they are a lot more difficult to generate and a lot harder to compare and are hard to be plugged into regression analysis. So if we see a statistically significant shift in one of more metrics of something like: ------------------------------------------------- | | $ ./timec -e -5,-4,-3,0,1,2,3 make -j16 bzImage | | [...] | Kernel: arch/x86/boot/bzImage is ready (#28) | | Performance counter stats for 'make': | | 628315.871980 task clock ticks (msecs) | | 42330 CPU migrations (events) | 124980 context switches (events) | 18698292 pagefaults (events) | 1351875946010 CPU cycles (events) | 1121901478363 instructions (events) | 10654788968 cache references (events) | 633581867 cache misses (events) | | Wall-clock time elapsed: 118348.109066 msecs | ----------------------------------------------- Becomes a _lot_ harder to ignore (and talk out of existence) than it is to ignore a few minor digits changing in: --------------------------------- | | $ time make -j16 bzImage | | real 0m12.146s | user 1m30.050s | sys 0m12.757s | --------------------------------- ( Especially as those minor digits tend to be rather noisy to begin with, due to us sampling system/user time from the timer interrupt. ) It becomes even harder to ignore statistically significant regressions if some of the metrics are hardware-generated hard physical facts - not something wishy-washy and statistical as stime/utime statistics. </plug> ;-) Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:03 ` Ingo Molnar 2009-01-20 14:14 ` Nick Piggin @ 2009-01-20 15:13 ` Ingo Molnar 2009-01-20 19:37 ` Ingo Molnar 2009-01-20 20:45 ` Jeremy Fitzhardinge 3 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 15:13 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > That shows the expected $cache risks here as well: i ran this on an > Extreme Edition CPU which has a ton of L2 cache [4MB] which mutes L2 > $cache misses quite a bit. [ there's no such thing as "L2 $cache misses" - what i wanted to say is that while the instruction cache size is rather small and static, a large L2 cache helps in keeping the costs of instruction-cache misses low - hence my measurement skews in favor of paravirt. ] Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:03 ` Ingo Molnar 2009-01-20 14:14 ` Nick Piggin 2009-01-20 15:13 ` Ingo Molnar @ 2009-01-20 19:37 ` Ingo Molnar 2009-01-20 20:45 ` Jeremy Fitzhardinge 3 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 19:37 UTC (permalink / raw) To: Nick Piggin Cc: Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton * Ingo Molnar <mingo@elte.hu> wrote: > ----------------------------------------------- > | Performance counter stats for './mmap-perf' | > ----------------------------------------------- > | | > | x86-defconfig | PARAVIRT=y > |------------------------------------------------------------------ > | > | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% > | | > | 1 | 1 CPU migrations > | 91 | 79 context switches > | 55945 | 55943 pagefaults > | ............................................ > | 3781392474 | 3918777174 CPU cycles +3.63% > | 1957153827 | 2161280486 instructions +10.43% > | 50234816 | 51303520 cache references +2.12% > | 5428258 | 5583728 cache misses +2.86% > | | > | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72% > | | > ----------------------------------- > > The most surprising element is that in the paravirt_ops case we run 204 > million more instructions - out of the ~2000 million instructions total. So because this test does exactly 1 million MM syscalls, the average is easy to calculate: The native kernel's average MM syscall cost is 1957 instructions - with CONFIG_PARAVIRT=y that increases by +10.43% to 2161 instructions. There's over 200 extra instructions executed per MM syscall that we only do due to CONFIG_PARAVIRT=y. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 14:03 ` Ingo Molnar ` (2 preceding siblings ...) 2009-01-20 19:37 ` Ingo Molnar @ 2009-01-20 20:45 ` Jeremy Fitzhardinge 2009-01-20 20:56 ` Ingo Molnar 3 siblings, 1 reply; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-20 20:45 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel Ingo Molnar wrote: > * Ingo Molnar <mingo@elte.hu> wrote: > > >>> Times I believe are in nanoseconds for lmbench, anyway lower is >>> better. >>> >>> non pv AVG=464.22 STD=5.56 >>> paravirt AVG=502.87 STD=7.36 >>> >>> Nearly 10% performance drop here, which is quite a bit... hopefully >>> people are testing the speed of their PV implementations against >>> non-PV bare metal :) >>> >> Ouch, that looks unacceptably expensive. All the major distros turn >> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express >> promise to have no measurable runtime overhead. >> > > Here are some more precise stats done via hw counters on a perfcounters > kernel using 'timec', running a modified version of the 'mmap performance > stress-test' app i made years ago. > > The MM benchmark app can be downloaded from: > > http://redhat.com/~mingo/misc/mmap-perf.c > > timec.c can be picked up from: > > http://redhat.com/~mingo/perfcounters/timec.c > > mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches > the mapped area as well with a certain chance. The patterns are > pseudo-random and the random seed is initialized to the same value so > repeated runs produce the exact same mmap sequence. > > I ran the test with a single thread and bound to a single core: > > # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1 > > [ I ran it as root - so that kernel-space hardware-counter statistics are > included as well. ] > > The results are quite surprisingly candid about the true costs of > paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y): > > ----------------------------------------------- > | Performance counter stats for './mmap-perf' | > ----------------------------------------------- > | | > | x86-defconfig | PARAVIRT=y > |------------------------------------------------------------------ > | > | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% > | | > | 1 | 1 CPU migrations > | 91 | 79 context switches > | 55945 | 55943 pagefaults > | ............................................ > | 3781392474 | 3918777174 CPU cycles +3.63% > | 1957153827 | 2161280486 instructions +10.43% > !! > | 50234816 | 51303520 cache references +2.12% > | 5428258 | 5583728 cache misses +2.86% > Is this I or D, or combined? > | | > | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72% > | | > ----------------------------------- > > The most surprising element is that in the paravirt_ops case we run 204 > million more instructions - out of the ~2000 million instructions total. > > That's an increase of over 10%! > Yow! That's pretty awful. We knew that static instruction count was up, but wouldn't have thought that it would hit the dynamic instruction count so much... I think there are some immediate tweaks we can make to the code generated for each call site, which will help to an extent. J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 20:45 ` Jeremy Fitzhardinge @ 2009-01-20 20:56 ` Ingo Molnar 2009-01-21 7:27 ` Nick Piggin 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 20:56 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: >> * Ingo Molnar <mingo@elte.hu> wrote: >> >> >>>> Times I believe are in nanoseconds for lmbench, anyway lower is >>>> better. >>>> >>>> non pv AVG=464.22 STD=5.56 >>>> paravirt AVG=502.87 STD=7.36 >>>> >>>> Nearly 10% performance drop here, which is quite a bit... hopefully >>>> people are testing the speed of their PV implementations against >>>> non-PV bare metal :) >>>> >>> Ouch, that looks unacceptably expensive. All the major distros turn >>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the >>> express promise to have no measurable runtime overhead. >>> >> >> Here are some more precise stats done via hw counters on a perfcounters >> kernel using 'timec', running a modified version of the 'mmap >> performance stress-test' app i made years ago. >> >> The MM benchmark app can be downloaded from: >> >> http://redhat.com/~mingo/misc/mmap-perf.c >> >> timec.c can be picked up from: >> >> http://redhat.com/~mingo/perfcounters/timec.c >> >> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and >> touches the mapped area as well with a certain chance. The patterns are >> pseudo-random and the random seed is initialized to the same value so >> repeated runs produce the exact same mmap sequence. >> >> I ran the test with a single thread and bound to a single core: >> >> # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1 >> >> [ I ran it as root - so that kernel-space hardware-counter statistics >> are included as well. ] >> >> The results are quite surprisingly candid about the true costs of >> paravirt_ops on the native kernel's overhead (CONFIG_PARAVIRT=y): >> >> ----------------------------------------------- >> | Performance counter stats for './mmap-perf' | >> ----------------------------------------------- >> | | >> | x86-defconfig | PARAVIRT=y >> |------------------------------------------------------------------ >> | >> | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% >> | | >> | 1 | 1 CPU migrations >> | 91 | 79 context switches >> | 55945 | 55943 pagefaults >> | ............................................ >> | 3781392474 | 3918777174 CPU cycles +3.63% >> | 1957153827 | 2161280486 instructions +10.43% >> > > !! > >> | 50234816 | 51303520 cache references +2.12% >> | 5428258 | 5583728 cache misses +2.86% >> > > Is this I or D, or combined? That's last-level-cache references+misses (L2 cache): Bit Position Event Name UMask Event Select CPUID.AH.EBX 3 LLC Reference 4FH 2EH 4 LLC Misses 41H 2EH Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 20:56 ` Ingo Molnar @ 2009-01-21 7:27 ` Nick Piggin 2009-01-21 22:23 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 32+ messages in thread From: Nick Piggin @ 2009-01-21 7:27 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel On Tue, Jan 20, 2009 at 09:56:53PM +0100, Ingo Molnar wrote: > > * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > >> | 50234816 | 51303520 cache references +2.12% > >> | 5428258 | 5583728 cache misses +2.86% > >> > > > > Is this I or D, or combined? > > That's last-level-cache references+misses (L2 cache): > > Bit Position Event Name UMask Event Select > CPUID.AH.EBX > 3 LLC Reference 4FH 2EH > 4 LLC Misses 41H 2EH Oh, _llc_ references/misses? Ouch. You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is coming from? Instruction fetches? It would be interesting to see how "the oltp" benchmark fares with CONFIG_PARAVIRT turned on. That workload lives and dies by the cache :) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-21 7:27 ` Nick Piggin @ 2009-01-21 22:23 ` Jeremy Fitzhardinge 2009-01-22 22:28 ` Zachary Amsden 0 siblings, 1 reply; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-21 22:23 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty, Andrew Morton, Xen-devel Nick Piggin wrote: > Oh, _llc_ references/misses? Ouch. > > You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark > is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is > coming from? Instruction fetches? > I assume so. There should be no extra data accesses with CONFIG_PARAVIRT (hm, there's probably some extra stack/spill traffic, but I surely hope that's not falling out of cache). J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-21 22:23 ` Jeremy Fitzhardinge @ 2009-01-22 22:28 ` Zachary Amsden 2009-01-22 22:44 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 32+ messages in thread From: Zachary Amsden @ 2009-01-22 22:28 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel On Wed, 2009-01-21 at 14:23 -0800, Jeremy Fitzhardinge wrote: > Nick Piggin wrote: > > Oh, _llc_ references/misses? Ouch. > > > > You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark > > is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is > > coming from? Instruction fetches? > > > > I assume so. There should be no extra data accesses with > CONFIG_PARAVIRT (hm, there's probably some extra stack/spill traffic, > but I surely hope that's not falling out of cache). These fragments, from native_pgd_val, certainly don't help: c0120f60: 55 push %ebp c0120f61: 89 e5 mov %esp,%ebp c0120f63: 5d pop %ebp c0120f64: c3 ret c0120f65: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi c0120f69: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi That is really disgusting. We absolutely should be patching away the function calls here in the native case.. not sure we do that today. Zach ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:28 ` Zachary Amsden @ 2009-01-22 22:44 ` Jeremy Fitzhardinge 2009-01-22 22:49 ` H. Peter Anvin 2009-01-22 22:55 ` Zachary Amsden 0 siblings, 2 replies; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-22 22:44 UTC (permalink / raw) To: Zachary Amsden Cc: Jeremy Fitzhardinge, Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel Zachary Amsden wrote: > These fragments, from native_pgd_val, certainly don't help: > > c0120f60: 55 push %ebp > c0120f61: 89 e5 mov %esp,%ebp > c0120f63: 5d pop %ebp > c0120f64: c3 ret > c0120f65: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi > c0120f69: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi > Yes, that's a rather awful noop; compiling without frame pointers reduces this to a single "ret". > That is really disgusting. We absolutely should be patching away the > function calls here in the native case.. not sure we do that today. > I did have some patches to do that at one point. If you set pgd_val = paravirt_nop, then the patching machinery will completely nop out the call site. The problem is that it depends on the calling convention using the same regs for the first arg and return - true for 32-bit, but not 64. We could fix that with identity functions which the patcher recognizes and can replace with either pure nops or inline appropriate register moves. Also, I just posted patches to get rid of all pvops calls when fetching or setting flags in a pte, which I hope will help. J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:44 ` Jeremy Fitzhardinge @ 2009-01-22 22:49 ` H. Peter Anvin 2009-01-22 22:58 ` Zachary Amsden 2009-01-22 22:55 ` Zachary Amsden 1 sibling, 1 reply; 32+ messages in thread From: H. Peter Anvin @ 2009-01-22 22:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Zachary Amsden, Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel Jeremy Fitzhardinge wrote: > > I did have some patches to do that at one point. If you set pgd_val = > paravirt_nop, then the patching machinery will completely nop out the > call site. The problem is that it depends on the calling convention > using the same regs for the first arg and return - true for 32-bit, but > not 64. We could fix that with identity functions which the patcher > recognizes and can replace with either pure nops or inline appropriate > register moves. > There is also the option to use assembly wrappers to avoid relying on the calling convention. This is particularly so since we have sites where as little as a two-byte instruction gets bloated up with huge push/pop sequences around a tiny instruction. Those would be better served with a direct call to a stub (5 bytes), which would be repatched to the two-byte instruction + 3 byte nop. -hpa ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:49 ` H. Peter Anvin @ 2009-01-22 22:58 ` Zachary Amsden 2009-01-22 23:52 ` H. Peter Anvin 0 siblings, 1 reply; 32+ messages in thread From: Zachary Amsden @ 2009-01-22 22:58 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel On Thu, 2009-01-22 at 14:49 -0800, H. Peter Anvin wrote: > There is also the option to use assembly wrappers to avoid relying on > the calling convention. This is particularly so since we have sites > where as little as a two-byte instruction gets bloated up with huge > push/pop sequences around a tiny instruction. Those would be better > served with a direct call to a stub (5 bytes), which would be repatched > to the two-byte instruction + 3 byte nop. Yes, for known trivial ops (most!), there isn't any reason to ever have a call to begin with; simply an inline instruction sequence would be fine, and only those callers that override the sequence would need to patch. It's possible to write clever macros to assure there is always space for a 5 byte call. Zach ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:58 ` Zachary Amsden @ 2009-01-22 23:52 ` H. Peter Anvin 2009-01-23 0:08 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 32+ messages in thread From: H. Peter Anvin @ 2009-01-22 23:52 UTC (permalink / raw) To: Zachary Amsden Cc: Jeremy Fitzhardinge, Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel Zachary Amsden wrote: > On Thu, 2009-01-22 at 14:49 -0800, H. Peter Anvin wrote: > >> There is also the option to use assembly wrappers to avoid relying on >> the calling convention. This is particularly so since we have sites >> where as little as a two-byte instruction gets bloated up with huge >> push/pop sequences around a tiny instruction. Those would be better >> served with a direct call to a stub (5 bytes), which would be repatched >> to the two-byte instruction + 3 byte nop. > > Yes, for known trivial ops (most!), there isn't any reason to ever have > a call to begin with; simply an inline instruction sequence would be > fine, and only those callers that override the sequence would need to > patch. It's possible to write clever macros to assure there is always > space for a 5 byte call. > It's functionally speaking the same thing... the advantage with starting out with the call and then patch in the native code as opposed to the other way around is to be able to handle things properly before we're ready to run the patching code. Right now a number of the call sites contain a huge push/pop sequence followed by an indirect call. We can patch in the native code to avoid the branch overhead, but the register constraints and icache footprint is unchanged. -hpa ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 23:52 ` H. Peter Anvin @ 2009-01-23 0:08 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-23 0:08 UTC (permalink / raw) To: H. Peter Anvin Cc: Zachary Amsden, Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel H. Peter Anvin wrote: > Right now a number of the call sites contain a huge push/pop sequence > followed by an indirect call. We can patch in the native code to > avoid the branch overhead, but the register constraints and icache > footprint is unchanged. That's true for the pvops hooks emitted in the .S files, but not so true for ones in C code (well, there are no explicit push/pops, but the presence of the call may cause the compiler to generate them). The .S hooks can definitely be cleaned up, but I don't think that's germane to Nick's observations that the mm code is showing slowdowns. J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:44 ` Jeremy Fitzhardinge 2009-01-22 22:49 ` H. Peter Anvin @ 2009-01-22 22:55 ` Zachary Amsden 2009-01-23 0:14 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 32+ messages in thread From: Zachary Amsden @ 2009-01-22 22:55 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel On Thu, 2009-01-22 at 14:44 -0800, Jeremy Fitzhardinge wrote: > I did have some patches to do that at one point. If you set pgd_val = > paravirt_nop, then the patching machinery will completely nop out the > call site. The problem is that it depends on the calling convention > using the same regs for the first arg and return - true for 32-bit, but > not 64. We could fix that with identity functions which the patcher > recognizes and can replace with either pure nops or inline appropriate > register moves. What about removing the identity functions entirely. They are useless, really. All that is needed is a patch site filled with nops for Xen to overwrite, just stuffing the value into the proper registers. For 64-bit, it can be a simple mov to satisfy the constraints. > Also, I just posted patches to get rid of all pvops calls when fetching > or setting flags in a pte, which I hope will help. Sounds like it will help. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:55 ` Zachary Amsden @ 2009-01-23 0:14 ` Jeremy Fitzhardinge 2009-01-27 7:59 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-23 0:14 UTC (permalink / raw) To: Zachary Amsden Cc: Nick Piggin, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel Zachary Amsden wrote: > What about removing the identity functions entirely. They are useless, > really. All that is needed is a patch site filled with nops for Xen to > overwrite, just stuffing the value into the proper registers. For > 64-bit, it can be a simple mov to satisfy the constraints. > I think it comes to the same thing really. Both end up generating a series of nops with values entering and leaving in well-defined registers. The x86-64 calling convention is a bit awkward because the first arg is in rdi and the ret is rax, so it can't quite be pure nops, or we use a non-standard calling-convention with appropriate thunks to call into C code. I think a mov is a better performance-complexity tradeoff. >> Also, I just posted patches to get rid of all pvops calls when fetching >> or setting flags in a pte, which I hope will help. >> > > Sounds like it will help. > ...but apparently not. J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-23 0:14 ` Jeremy Fitzhardinge @ 2009-01-27 7:59 ` Ingo Molnar 2009-01-27 8:24 ` Jeremy Fitzhardinge 2009-01-27 10:17 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-27 7:59 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Zachary Amsden, Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel * Jeremy Fitzhardinge <jeremy@goop.org> wrote: >>> Also, I just posted patches to get rid of all pvops calls when >>> fetching or setting flags in a pte, which I hope will help. >> >> Sounds like it will help. > > ...but apparently not. ping? This is a very serious paravirt_ops slowdown affecting the native kernel's performance to the tune of 5-10% in certain workloads. It's been about 2 years ago that paravirt_ops went upstream, when you told us that something like this would never happen, that paravirt_ops is designed so flexibly that it will never hinder the native kernel - and if it does it will be easy to fix it. Now is the time to fulfill that promise. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-27 7:59 ` Ingo Molnar @ 2009-01-27 8:24 ` Jeremy Fitzhardinge 2009-01-27 10:17 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-27 8:24 UTC (permalink / raw) To: Ingo Molnar Cc: Zachary Amsden, Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel Ingo Molnar wrote: > This is a very serious paravirt_ops slowdown affecting the native kernel's > performance to the tune of 5-10% in certain workloads. > > It's been about 2 years ago that paravirt_ops went upstream, when you told > us that something like this would never happen, that paravirt_ops is > designed so flexibly that it will never hinder the native kernel - and if > it does it will be easy to fix it. Now is the time to fulfill that > promise. > Yep, working on it. J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-27 7:59 ` Ingo Molnar 2009-01-27 8:24 ` Jeremy Fitzhardinge @ 2009-01-27 10:17 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-27 10:17 UTC (permalink / raw) To: Ingo Molnar Cc: Zachary Amsden, Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au, Andrew Morton, Xen-devel [-- Attachment #1: Type: text/plain, Size: 1640 bytes --] Ingo Molnar wrote: > ping? > > This is a very serious paravirt_ops slowdown affecting the native kernel's > performance to the tune of 5-10% in certain workloads. > > It's been about 2 years ago that paravirt_ops went upstream, when you told > us that something like this would never happen, that paravirt_ops is > designed so flexibly that it will never hinder the native kernel - and if > it does it will be easy to fix it. Now is the time to fulfill that > promise. I couldn't exactly reproduce your results, but I guess they're similar in shape. Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 1-5). Interestingly I'm seeing identical instruction counts for pvops vs non-pvops, and a lower cycle count. The cache references are way up and the miss rate is up a bit, which I guess is the source of the slowdown. With the attached patch, I get a clear improvement; it replaces the do-nothing pte_val/make_pte functions with inlined movs to move the argument to return, overpatching the 6-byte indirect call (on i386 it would just be all nopped out). CPU cycles and cache misses are way down, and the tick count is down from ~5% worse to ~2%. But the cache reference rate is even higher, which really doesn't make sense to me. But the patch is a clear improvement, and its hard to see how it could make anything worse (its always going to replace an indirect call with simple inlined code). (Full numbers in spreadsheet.) I have a couple of other patches to reduce the register pressure of the pvops calls, but I'm trying to work out how to make sure its not all to complex and/or fragile. J [-- Attachment #2: pvops-mmap-measurements.ods --] [-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 30546 bytes --] [-- Attachment #3: paravirt-ident.patch --] [-- Type: text/plain, Size: 6903 bytes --] Subject: x86/pvops: add a paravirt_indent functions to allow special patching Several paravirt ops implementations simply return their arguments, the most obvious being the make_pte/pte_val class of operations on native. On 32-bit, the identity function is literally a no-op, as the calling convention uses the same registers for the first argument and return. On 64-bit, it can be implemented with a single "mov". This patch adds special identity functions for 32 and 64 bit argument, and machinery to recognize them and replace them with either nops or a mov as appropriate. At the moment, the only users for the identity functions are the pagetable entry conversion functions. The result is a measureable improvement on pagetable-heavy benchmarks (2-3%, reducing the pvops overhead from 5 to 2%). Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/include/asm/paravirt.h | 5 ++ arch/x86/kernel/paravirt.c | 75 ++++++++++++++++++++++++++++++----- arch/x86/kernel/paravirt_patch_32.c | 12 +++++ arch/x86/kernel/paravirt_patch_64.c | 15 +++++++ 4 files changed, 98 insertions(+), 9 deletions(-) =================================================================== --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -390,6 +390,8 @@ asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":") unsigned paravirt_patch_nop(void); +unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len); +unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len); unsigned paravirt_patch_ignore(unsigned len); unsigned paravirt_patch_call(void *insnbuf, const void *target, u16 tgt_clobbers, @@ -1378,6 +1380,9 @@ } void _paravirt_nop(void); +u32 _paravirt_ident_32(u32); +u64 _paravirt_ident_64(u64); + #define paravirt_nop ((void *)_paravirt_nop) void paravirt_use_bytelocks(void); =================================================================== --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -44,6 +44,17 @@ { } +/* identity function, which can be inlined */ +u32 _paravirt_ident_32(u32 x) +{ + return x; +} + +u64 _paravirt_ident_64(u64 x) +{ + return x; +} + static void __init default_banner(void) { printk(KERN_INFO "Booting paravirtualized kernel on %s\n", @@ -138,9 +149,16 @@ if (opfunc == NULL) /* If there's no function, patch it with a ud2a (BUG) */ ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a)); - else if (opfunc == paravirt_nop) + else if (opfunc == _paravirt_nop) /* If the operation is a nop, then nop the callsite */ ret = paravirt_patch_nop(); + + /* identity functions just return their single argument */ + else if (opfunc == _paravirt_ident_32) + ret = paravirt_patch_ident_32(insnbuf, len); + else if (opfunc == _paravirt_ident_64) + ret = paravirt_patch_ident_64(insnbuf, len); + else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) || type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) || type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) || @@ -373,6 +391,45 @@ #endif }; +typedef pte_t make_pte_t(pteval_t); +typedef pmd_t make_pmd_t(pmdval_t); +typedef pud_t make_pud_t(pudval_t); +typedef pgd_t make_pgd_t(pgdval_t); + +typedef pteval_t pte_val_t(pte_t); +typedef pmdval_t pmd_val_t(pmd_t); +typedef pudval_t pud_val_t(pud_t); +typedef pgdval_t pgd_val_t(pgd_t); + + +#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE) +/* 32-bit pagetable entries */ +#define paravirt_native_make_pte (make_pte_t *)_paravirt_ident_32 +#define paravirt_native_pte_val (pte_val_t *)_paravirt_ident_32 + +#define paravirt_native_make_pmd (make_pmd_t *)_paravirt_ident_32 +#define paravirt_native_pmd_val (pmd_val_t *)_paravirt_ident_32 + +#define paravirt_native_make_pud (make_pud_t *)_paravirt_ident_32 +#define paravirt_native_pud_val (pud_val_t *)_paravirt_ident_32 + +#define paravirt_native_make_pgd (make_pgd_t *)_paravirt_ident_32 +#define paravirt_native_pgd_val (pgd_val_t *)_paravirt_ident_32 +#else +/* 64-bit pagetable entries */ +#define paravirt_native_make_pte (make_pte_t *)_paravirt_ident_64 +#define paravirt_native_pte_val (pte_val_t *)_paravirt_ident_64 + +#define paravirt_native_make_pmd (make_pmd_t *)_paravirt_ident_64 +#define paravirt_native_pmd_val (pmd_val_t *)_paravirt_ident_64 + +#define paravirt_native_make_pud (make_pud_t *)_paravirt_ident_64 +#define paravirt_native_pud_val (pud_val_t *)_paravirt_ident_64 + +#define paravirt_native_make_pgd (make_pgd_t *)_paravirt_ident_64 +#define paravirt_native_pgd_val (pgd_val_t *)_paravirt_ident_64 +#endif + struct pv_mmu_ops pv_mmu_ops = { #ifndef CONFIG_X86_64 .pagetable_setup_start = native_pagetable_setup_start, @@ -424,21 +481,21 @@ .pmd_clear = native_pmd_clear, #endif .set_pud = native_set_pud, - .pmd_val = native_pmd_val, - .make_pmd = native_make_pmd, + .pmd_val = paravirt_native_pmd_val, + .make_pmd = paravirt_native_make_pmd, #if PAGETABLE_LEVELS == 4 - .pud_val = native_pud_val, - .make_pud = native_make_pud, + .pud_val = paravirt_native_pud_val, + .make_pud = paravirt_native_make_pud, .set_pgd = native_set_pgd, #endif #endif /* PAGETABLE_LEVELS >= 3 */ - .pte_val = native_pte_val, - .pgd_val = native_pgd_val, + .pte_val = paravirt_native_pte_val, + .pgd_val = paravirt_native_pgd_val, - .make_pte = native_make_pte, - .make_pgd = native_make_pgd, + .make_pte = paravirt_native_make_pte, + .make_pgd = paravirt_native_make_pgd, .dup_mmap = paravirt_nop, .exit_mmap = paravirt_nop, =================================================================== --- a/arch/x86/kernel/paravirt_patch_32.c +++ b/arch/x86/kernel/paravirt_patch_32.c @@ -12,6 +12,18 @@ DEF_NATIVE(pv_cpu_ops, clts, "clts"); DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc"); +unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) +{ + /* arg in %eax, return in %eax */ + return 0; +} + +unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len) +{ + /* arg in %edx:%eax, return in %edx:%eax */ + return 0; +} + unsigned native_patch(u8 type, u16 clobbers, void *ibuf, unsigned long addr, unsigned len) { =================================================================== --- a/arch/x86/kernel/paravirt_patch_64.c +++ b/arch/x86/kernel/paravirt_patch_64.c @@ -19,6 +19,21 @@ DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl"); DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs"); +DEF_NATIVE(, mov32, "mov %edi, %eax"); +DEF_NATIVE(, mov64, "mov %rdi, %rax"); + +unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) +{ + return paravirt_patch_insns(insnbuf, len, + start__mov32, end__mov32); +} + +unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len) +{ + return paravirt_patch_insns(insnbuf, len, + start__mov64, end__mov64); +} + unsigned native_patch(u8 type, u16 clobbers, void *ibuf, unsigned long addr, unsigned len) { ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 11:26 ` Ingo Molnar 2009-01-20 12:34 ` Nick Piggin 2009-01-20 14:03 ` Ingo Molnar @ 2009-01-20 19:05 ` Zachary Amsden 2009-01-20 19:31 ` Ingo Molnar 2009-01-22 22:26 ` Jeremy Fitzhardinge 3 siblings, 1 reply; 32+ messages in thread From: Zachary Amsden @ 2009-01-20 19:05 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au On Tue, 2009-01-20 at 03:26 -0800, Ingo Molnar wrote: > Jeremy, any ideas where this slowdown comes from and how it could be > fixed? Well I'm early responding to this thread before reading on, but I looked at the generated assembly for some common mm paths and it looked awful. The biggest loser was probably having functions to convert pte_t back and forth to pteval_t, which makes most potential mask / shift optimizations impossible - indeed, because the compiler doesn't even understand pte_val(X) = Y is static over the lifetime of the function, it often calls these same conversions back and forth several times, and because this is often done inside hidden macros, it's not even possible to save a cached value in most places. The bulk of state required to keep this extra conversion around ties up a lot of registers and as a result heavily limits potential further optimizations. The code did not look more branchy to me, however, and gcc seemed to do a good job with lining up a nice branch structure in the few paths I looked at. Zach ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 19:05 ` Zachary Amsden @ 2009-01-20 19:31 ` Ingo Molnar 0 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-01-20 19:31 UTC (permalink / raw) To: Zachary Amsden Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au * Zachary Amsden <zach@vmware.com> wrote: > On Tue, 2009-01-20 at 03:26 -0800, Ingo Molnar wrote: > > > Jeremy, any ideas where this slowdown comes from and how it could be > > fixed? > > Well I'm early responding to this thread before reading on, but I looked > at the generated assembly for some common mm paths and it looked awful. > The biggest loser was probably having functions to convert pte_t back > and forth to pteval_t, which makes most potential mask / shift > optimizations impossible - indeed, because the compiler doesn't even > understand pte_val(X) = Y is static over the lifetime of the function, > it often calls these same conversions back and forth several times, and > because this is often done inside hidden macros, it's not even possible > to save a cached value in most places. > > The bulk of state required to keep this extra conversion around ties up > a lot of registers and as a result heavily limits potential further > optimizations. > > The code did not look more branchy to me, however, and gcc seemed to do > a good job with lining up a nice branch structure in the few paths I > looked at. i've extended my mmap test with branch execution hw-perfcounter stats: ----------------------------------------------- | Performance counter stats for './mmap-perf' | ----------------------------------------------- | | | x86-defconfig | PARAVIRT=y |------------------------------------------------------------------ | | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% | | | 1 | 1 CPU migrations | 91 | 79 context switches | 55945 | 55943 pagefaults | ............................................ | 3781392474 | 3918777174 CPU cycles +3.63% | 1957153827 | 2161280486 instructions +10.43% | 50234816 | 51303520 cache references +2.12% | 5428258 | 5583728 cache misses +2.86% | | 437983499 | 478967061 branches +9.36% | 32486067 | 32336874 branch-misses -0.46% | | | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72% | | ----------------------------------- So we execute 9.36% more branches - i.e. very noticeably higher as well. The CPU predicts them slightly more effectively though, the -0.46% for branch-misses is well above measurement noise (of ~0.02% for the branch metric) so it's a systematic effect. Non-functional 'boring' bloat tends to be easier to predict so it's not necessarily a real surprise. That also explains why despite +10.43% more instructions the total cycle count went up by a comparatively smaller +3.63%. [ that's 64-bit x86 btw. ] Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-20 11:26 ` Ingo Molnar ` (2 preceding siblings ...) 2009-01-20 19:05 ` Zachary Amsden @ 2009-01-22 22:26 ` Jeremy Fitzhardinge 2009-01-22 23:04 ` Ingo Molnar 3 siblings, 1 reply; 32+ messages in thread From: Jeremy Fitzhardinge @ 2009-01-22 22:26 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty Ingo Molnar wrote: > Ouch, that looks unacceptably expensive. All the major distros turn > CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express > promise to have no measurable runtime overhead. > > ( And i suspect the real life mmap cost is probably even more expensive, > as on a Barcelona all of lmbench fits into the cache hence we dont see > any real $cache overhead. ) > > Jeremy, any ideas where this slowdown comes from and how it could be > fixed? > I just posted a couple of patches to pick some low-hanging fruit. It turns out that we don't need to do any pvops calls to do pte flag manipulations. I'd be interested to see how much of a difference it makes (it reduces the static code size by a few k). J ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 22:26 ` Jeremy Fitzhardinge @ 2009-01-22 23:04 ` Ingo Molnar 2009-01-22 23:30 ` Zachary Amsden 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-01-22 23:04 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa, jeremy, chrisw, zach, rusty * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ingo Molnar wrote: >> Ouch, that looks unacceptably expensive. All the major distros turn >> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express >> promise to have no measurable runtime overhead. >> >> ( And i suspect the real life mmap cost is probably even more expensive, >> as on a Barcelona all of lmbench fits into the cache hence we dont see >> any real $cache overhead. ) >> >> Jeremy, any ideas where this slowdown comes from and how it could be >> fixed? >> > > I just posted a couple of patches to pick some low-hanging fruit. It > turns out that we don't need to do any pvops calls to do pte flag > manipulations. I'd be interested to see how much of a difference it > makes (it reduces the static code size by a few k). I've tried your patches - but can see no significant reduction in overhead. I've updated my table with numbers from your patches: ----------------------------------------------- | Performance counter stats for './mmap-perf' | ----------------------------------------------- | | | | defconfig | PARAVIRT=y | +Jeremy |----------------------------------------------------------------------- | | 1311.55452 | 1360.62493 | 1378.94464 task clock (msecs) +3.74% | | | | 1 | 1 | 0 CPU migrations | 91 | 79 | 77 context switches | 55945 | 55943 | 55980 pagefaults |....................................................................... | 3781392474 | 3918777174 | 3907189795 CPU cycles +3.63% | 1957153827 | 2161280486 | 2161741689 instructions +10.43% | 50234816 | 51303520 | 50619593 cache references +2.12% | 5428258 | 5583728 | 5575808 cache misses +2.86% | | 437983499 | 478967061 | 479053595 branches +9.36% | 32486067 | 32336874 | 32377710 branch-misses -0.46% | | | 1314.78246 | 1363.69444 | 1357.58161 time elapsed (msecs) +3.72% | | ------------------------------------------------------------------------ '+Jeremy' is a CONFIG_PARAVIRT=y run done with your patches. The most stable count is the instruction count: | 1957153827 | 2161280486 | 2161741689 instructions +10.43% But your two patches did not reduce the instruction count in any measurable way. In any case, it is rather inefficient of me proxy-testing your patches, you can do these measurements yourself too on any Core2 or later Intel CPU, by running tip/master plus picking up these two utilities: http://people.redhat.com/mingo/perfcounters/perfstat.c http://redhat.com/~mingo/misc/mmap-perf.c building them and running this (as root): taskset 1 ./perfstat ./mmap-perf 1 it will give you numbers like the ones above. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT 2009-01-22 23:04 ` Ingo Molnar @ 2009-01-22 23:30 ` Zachary Amsden 0 siblings, 0 replies; 32+ messages in thread From: Zachary Amsden @ 2009-01-22 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: Jeremy Fitzhardinge, Nick Piggin, Linux Kernel Mailing List, Linus Torvalds, hpa@zytor.com, jeremy@xensource.com, chrisw@sous-sol.org, rusty@rustcorp.com.au [-- Attachment #1: Type: text/plain, Size: 650 bytes --] On Thu, 2009-01-22 at 15:04 -0800, Ingo Molnar wrote: > * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > In any case, it is rather inefficient of me proxy-testing your patches, > you can do these measurements yourself too on any Core2 or later Intel > CPU, by running tip/master plus picking up these two utilities: Eek, I have no time to spend on this right now, but if anyone is curious to run this patch (which heavily breaks Xen), I suspect it will cure most of the performance ailments. Back when we did the VMI prototyping, we never saw any significant benchmark reductions until the introduction of the M-to-P conversion functions. Zach [-- Attachment #2: paravirt-drop-mpn-ops.patch --] [-- Type: text/x-patch, Size: 4338 bytes --] diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h index e9873a2..e5b39db 100644 --- a/arch/x86/include/asm/page.h +++ b/arch/x86/include/asm/page.h @@ -155,10 +155,6 @@ static inline pteval_t native_pte_flags(pte_t pte) #define pgprot_val(x) ((x).pgprot) #define __pgprot(x) ((pgprot_t) { (x) } ) -#ifdef CONFIG_PARAVIRT -#include <asm/paravirt.h> -#else /* !CONFIG_PARAVIRT */ - #define pgd_val(x) native_pgd_val(x) #define __pgd(x) native_make_pgd(x) @@ -176,6 +172,8 @@ static inline pteval_t native_pte_flags(pte_t pte) #define pte_flags(x) native_pte_flags(x) #define __pte(x) native_make_pte(x) +#ifdef CONFIG_PARAVIRT +#include <asm/paravirt.h> #endif /* CONFIG_PARAVIRT */ #define __pa(x) __phys_addr((unsigned long)(x)) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ba3e2ff..7de3169 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -278,13 +278,6 @@ struct pv_mmu_ops { void (*ptep_modify_prot_commit)(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); - pteval_t (*pte_val)(pte_t); - pteval_t (*pte_flags)(pte_t); - pte_t (*make_pte)(pteval_t pte); - - pgdval_t (*pgd_val)(pgd_t); - pgd_t (*make_pgd)(pgdval_t pgd); - #if PAGETABLE_LEVELS >= 3 #ifdef CONFIG_X86_PAE void (*set_pte_atomic)(pte_t *ptep, pte_t pteval); @@ -298,13 +291,7 @@ struct pv_mmu_ops { void (*set_pud)(pud_t *pudp, pud_t pudval); - pmdval_t (*pmd_val)(pmd_t); - pmd_t (*make_pmd)(pmdval_t pmd); - #if PAGETABLE_LEVELS == 4 - pudval_t (*pud_val)(pud_t); - pud_t (*make_pud)(pudval_t pud); - void (*set_pgd)(pgd_t *pudp, pgd_t pgdval); #endif /* PAGETABLE_LEVELS == 4 */ #endif /* PAGETABLE_LEVELS >= 3 */ @@ -1054,81 +1041,6 @@ static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr, PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep); } -static inline pte_t __pte(pteval_t val) -{ - pteval_t ret; - - if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALL2(pteval_t, - pv_mmu_ops.make_pte, - val, (u64)val >> 32); - else - ret = PVOP_CALL1(pteval_t, - pv_mmu_ops.make_pte, - val); - - return (pte_t) { .pte = ret }; -} - -static inline pteval_t pte_val(pte_t pte) -{ - pteval_t ret; - - if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_val, - pte.pte, (u64)pte.pte >> 32); - else - ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_val, - pte.pte); - - return ret; -} - -static inline pteval_t pte_flags(pte_t pte) -{ - pteval_t ret; - - if (sizeof(pteval_t) > sizeof(long)) - ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_flags, - pte.pte, (u64)pte.pte >> 32); - else - ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_flags, - pte.pte); - -#ifdef CONFIG_PARAVIRT_DEBUG - BUG_ON(ret & PTE_PFN_MASK); -#endif - return ret; -} - -static inline pgd_t __pgd(pgdval_t val) -{ - pgdval_t ret; - - if (sizeof(pgdval_t) > sizeof(long)) - ret = PVOP_CALL2(pgdval_t, pv_mmu_ops.make_pgd, - val, (u64)val >> 32); - else - ret = PVOP_CALL1(pgdval_t, pv_mmu_ops.make_pgd, - val); - - return (pgd_t) { ret }; -} - -static inline pgdval_t pgd_val(pgd_t pgd) -{ - pgdval_t ret; - - if (sizeof(pgdval_t) > sizeof(long)) - ret = PVOP_CALL2(pgdval_t, pv_mmu_ops.pgd_val, - pgd.pgd, (u64)pgd.pgd >> 32); - else - ret = PVOP_CALL1(pgdval_t, pv_mmu_ops.pgd_val, - pgd.pgd); - - return ret; -} - #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr, pte_t *ptep) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index e4c8fb6..ac48a2d 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -424,23 +424,12 @@ struct pv_mmu_ops pv_mmu_ops = { .pmd_clear = native_pmd_clear, #endif .set_pud = native_set_pud, - .pmd_val = native_pmd_val, - .make_pmd = native_make_pmd, #if PAGETABLE_LEVELS == 4 - .pud_val = native_pud_val, - .make_pud = native_make_pud, .set_pgd = native_set_pgd, #endif #endif /* PAGETABLE_LEVELS >= 3 */ - .pte_val = native_pte_val, - .pte_flags = native_pte_flags, - .pgd_val = native_pgd_val, - - .make_pte = native_make_pte, - .make_pgd = native_make_pgd, - .dup_mmap = paravirt_nop, .exit_mmap = paravirt_nop, .activate_mm = paravirt_nop, ^ permalink raw reply related [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-01-27 10:17 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-01-20 11:05 lmbench lat_mmap slowdown with CONFIG_PARAVIRT Nick Piggin 2009-01-20 11:26 ` Ingo Molnar 2009-01-20 12:34 ` Nick Piggin 2009-01-20 12:45 ` Ingo Molnar 2009-01-20 13:41 ` Nick Piggin 2009-01-20 14:03 ` Ingo Molnar 2009-01-20 14:14 ` Nick Piggin 2009-01-20 14:17 ` Ingo Molnar 2009-01-20 14:41 ` Nick Piggin 2009-01-20 15:00 ` Ingo Molnar 2009-01-20 15:13 ` Ingo Molnar 2009-01-20 19:37 ` Ingo Molnar 2009-01-20 20:45 ` Jeremy Fitzhardinge 2009-01-20 20:56 ` Ingo Molnar 2009-01-21 7:27 ` Nick Piggin 2009-01-21 22:23 ` Jeremy Fitzhardinge 2009-01-22 22:28 ` Zachary Amsden 2009-01-22 22:44 ` Jeremy Fitzhardinge 2009-01-22 22:49 ` H. Peter Anvin 2009-01-22 22:58 ` Zachary Amsden 2009-01-22 23:52 ` H. Peter Anvin 2009-01-23 0:08 ` Jeremy Fitzhardinge 2009-01-22 22:55 ` Zachary Amsden 2009-01-23 0:14 ` Jeremy Fitzhardinge 2009-01-27 7:59 ` Ingo Molnar 2009-01-27 8:24 ` Jeremy Fitzhardinge 2009-01-27 10:17 ` Jeremy Fitzhardinge 2009-01-20 19:05 ` Zachary Amsden 2009-01-20 19:31 ` Ingo Molnar 2009-01-22 22:26 ` Jeremy Fitzhardinge 2009-01-22 23:04 ` Ingo Molnar 2009-01-22 23:30 ` Zachary Amsden
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox