* (no subject)
@ 2011-07-22 0:32 Jason Baron
2011-07-22 0:57 ` Paul Turner
0 siblings, 1 reply; 13+ messages in thread
From: Jason Baron @ 2011-07-22 0:32 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov
rth@redhat.com
Bcc:
Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
when bandwidth control is inactive
Reply-To:
In-Reply-To: <20110721184758.403388616@google.com>
On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> So I'm seeing some strange costs associated with jump_labels; while on paper
> the branches and instructions retired improves (as expected) we're taking an
> unexpected hit in IPC.
>
> [From the initial mail we have workloads:
> mkdir -p /cgroup/cpu/test
> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> ]
>
> To make some of the figures more clear:
>
> Legend:
> !BWC = tip + bwc, BWC compiled out
> BWC = tip + bwc
> BWC_JL = tip + bwc + jump label (this patch)
>
>
> Now, comparing under W1 we see:
> W1: BWC vs BWC_JL
> instructions cycles branches elapsed
> ---------------------------------------------------------------------------------------------------------------------
> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline]
> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel]
> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel]
> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel]
>
> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline]
> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel]
> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel]
> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel]
>
> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline]
> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel]
> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel]
> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel]
> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> the unconstrained case with BWC.
>
>
> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> measurements for BWC_JL, with (%d) being the relative difference to their
> BWC counterparts.
>
> W1: BWC vs BWC_JL is very similar.
> BWC vs BWC_JL
> clovertown [BWC] 985732031 1283113452 175621212 1.375905653
> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel]
> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel]
> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel]
>
> barcelona [BWC] 982139920 1078757792 175417574 1.069537049
> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel]
> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel]
> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel]
>
> westmere [BWC] 918633403 896047900 166496917 0.754629182
> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel]
> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel]
> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel]
>
> Now this is rather odd, almost across the board we're seeing the expected
> drops in instructions and branches, yet we appear to be paying a heavy IPC
> price. The fact that wall-time has scaled equivalently with cycles roughly
> rules out the cycles counter being off.
>
> We are seeing the expected behavior in the bandwidth enabled case;
> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> and instruction which shows up on all the numbers above.
>
> With respect to compiler mangling the text is essentially unchanged in size.
> One lurking suspicion is whether the inserted nops have perturbed some of the
> jmp/branch alignments?
>
> text data bss dec hex filename
> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label
> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label
>
> I have checked to make sure that the right instructions are being patched in
> at run-time. I've also pulled a fully patched jump_label out of the kernel
> into a userspace test (and benchmarked it directly under perf). The results
> here are also exactly as expected.
>
> e.g.
> Performance counter stats for './jump_test':
> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> Performance counter stats for './jump_test 1':
> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
>
> Overall if we can fix the IPC the benefit in the globally unconstrained case
> looks really good.
>
> Any thoughts Jason?
>
Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
more optimal.
thanks,
-Jason
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2011-07-22 0:32 Jason Baron @ 2011-07-22 0:57 ` Paul Turner 2011-07-22 1:17 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron 0 siblings, 1 reply; 13+ messages in thread From: Paul Turner @ 2011-07-22 0:57 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: > rth@redhat.com > Bcc: > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead > when bandwidth control is inactive > Reply-To: > In-Reply-To: <20110721184758.403388616@google.com> > > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: >> So I'm seeing some strange costs associated with jump_labels; while on paper >> the branches and instructions retired improves (as expected) we're taking an >> unexpected hit in IPC. >> >> [From the initial mail we have workloads: >> mkdir -p /cgroup/cpu/test >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" >> ] >> >> To make some of the figures more clear: >> >> Legend: >> !BWC = tip + bwc, BWC compiled out >> BWC = tip + bwc >> BWC_JL = tip + bwc + jump label (this patch) >> >> >> Now, comparing under W1 we see: >> W1: BWC vs BWC_JL >> instructions cycles branches elapsed >> --------------------------------------------------------------------------------------------------------------------- >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in >> the unconstrained case with BWC. >> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on >> measurements for BWC_JL, with (%d) being the relative difference to their >> BWC counterparts. >> >> W1: BWC vs BWC_JL is very similar. >> BWC vs BWC_JL >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182 >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] >> >> Now this is rather odd, almost across the board we're seeing the expected >> drops in instructions and branches, yet we appear to be paying a heavy IPC >> price. The fact that wall-time has scaled equivalently with cycles roughly >> rules out the cycles counter being off. >> >> We are seeing the expected behavior in the bandwidth enabled case; >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch >> and instruction which shows up on all the numbers above. >> >> With respect to compiler mangling the text is essentially unchanged in size. >> One lurking suspicion is whether the inserted nops have perturbed some of the >> jmp/branch alignments? >> >> text data bss dec hex filename >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label >> >> I have checked to make sure that the right instructions are being patched in >> at run-time. I've also pulled a fully patched jump_label out of the kernel >> into a userspace test (and benchmarked it directly under perf). The results >> here are also exactly as expected. >> >> e.g. >> Performance counter stats for './jump_test': >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles >> Performance counter stats for './jump_test 1': >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case >> looks really good. >> >> Any thoughts Jason? >> > > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code > more optimal. > Ah I should have mentioned that was one of the holes I stared down: Builds were -O2 (gcc-4.6.1) and $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set Same kernel image across all platforms. > thanks, > > -Jason > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-22 0:57 ` Paul Turner @ 2011-07-22 1:17 ` Jason Baron 2011-07-22 1:38 ` Paul Turner 0 siblings, 1 reply; 13+ messages in thread From: Jason Baron @ 2011-07-22 1:17 UTC (permalink / raw) To: Paul Turner Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote: > On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: > > rth@redhat.com > > Bcc: > > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead > > when bandwidth control is inactive > > Reply-To: > > In-Reply-To: <20110721184758.403388616@google.com> > > > > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: > >> So I'm seeing some strange costs associated with jump_labels; while on paper > >> the branches and instructions retired improves (as expected) we're taking an > >> unexpected hit in IPC. > >> > >> [From the initial mail we have workloads: > >> mkdir -p /cgroup/cpu/test > >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) > >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" > >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" > >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" > >> ] > >> > >> To make some of the figures more clear: > >> > >> Legend: > >> !BWC = tip + bwc, BWC compiled out > >> BWC = tip + bwc > >> BWC_JL = tip + bwc + jump label (this patch) > >> > >> > >> Now, comparing under W1 we see: > >> W1: BWC vs BWC_JL > >> instructions cycles branches elapsed > >> --------------------------------------------------------------------------------------------------------------------- > >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] > >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] > >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] > >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] > >> > >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] > >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] > >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] > >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] > >> > >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] > >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] > >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] > >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] > >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in > >> the unconstrained case with BWC. > >> > >> > >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on > >> measurements for BWC_JL, with (%d) being the relative difference to their > >> BWC counterparts. > >> > >> W1: BWC vs BWC_JL is very similar. > >> BWC vs BWC_JL > >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 > >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] > >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] > >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] > >> > >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 > >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] > >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] > >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] > >> > >> westmere [BWC] 918633403 896047900 166496917 0.754629182 > >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] > >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] > >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] > >> > >> Now this is rather odd, almost across the board we're seeing the expected > >> drops in instructions and branches, yet we appear to be paying a heavy IPC > >> price. The fact that wall-time has scaled equivalently with cycles roughly > >> rules out the cycles counter being off. > >> if i understand your results, for barcelona you did see an improvement in cycles and eslapsed time with jump labels for unconstrained? > >> We are seeing the expected behavior in the bandwidth enabled case; > >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch > >> and instruction which shows up on all the numbers above. > >> > >> With respect to compiler mangling the text is essentially unchanged in size. > >> One lurking suspicion is whether the inserted nops have perturbed some of the > >> jmp/branch alignments? hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who worked on the 'asm goto' in gcc. > >> > >> text data bss dec hex filename > >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label > >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label > >> the other thing here is that vmlinux.jump_label includes the extra kernel/jump_label.o file, so you can sort of subtract the text size of that file to do a fair comparison. Also, I would have expected the data section to have increased more with jump labels enabled. Are tracepoints disabled (a current user of jump labels). > >> I have checked to make sure that the right instructions are being patched in > >> at run-time. I've also pulled a fully patched jump_label out of the kernel > >> into a userspace test (and benchmarked it directly under perf). The results > >> here are also exactly as expected. > >> > >> e.g. > >> Performance counter stats for './jump_test': > >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles > >> Performance counter stats for './jump_test 1': > >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles > >> what no-op did you use in userspace? I wouldn't think the no-op choice would make any difference though...At compile time we use a 'jmp 0', and then at boot we dynamically patch the 'jmp 0' with the no-op we think works best... thanks, -Jason > >> Overall if we can fix the IPC the benefit in the globally unconstrained case > >> looks really good. > >> > >> Any thoughts Jason? > >> > > > > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when > > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code > > more optimal. > > > > Ah I should have mentioned that was one of the holes I stared down: > > Builds were -O2 (gcc-4.6.1) and > $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE > # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set > > Same kernel image across all platforms. > > > > > > > > thanks, > > > > -Jason > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-22 1:17 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron @ 2011-07-22 1:38 ` Paul Turner 2011-07-27 21:58 ` Jason Baron 0 siblings, 1 reply; 13+ messages in thread From: Paul Turner @ 2011-07-22 1:38 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote: > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote: >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: >> > rth@redhat.com >> > Bcc: >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead >> > when bandwidth control is inactive >> > Reply-To: >> > In-Reply-To: <20110721184758.403388616@google.com> >> > >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: >> >> So I'm seeing some strange costs associated with jump_labels; while on paper >> >> the branches and instructions retired improves (as expected) we're taking an >> >> unexpected hit in IPC. >> >> >> >> [From the initial mail we have workloads: >> >> mkdir -p /cgroup/cpu/test >> >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) >> >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" >> >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" >> >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" >> >> ] >> >> >> >> To make some of the figures more clear: >> >> >> >> Legend: >> >> !BWC = tip + bwc, BWC compiled out >> >> BWC = tip + bwc >> >> BWC_JL = tip + bwc + jump label (this patch) >> >> >> >> >> >> Now, comparing under W1 we see: >> >> W1: BWC vs BWC_JL >> >> instructions cycles branches elapsed >> >> --------------------------------------------------------------------------------------------------------------------- >> >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] >> >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] >> >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] >> >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] >> >> >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] >> >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] >> >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] >> >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] >> >> >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] >> >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] >> >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] >> >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in >> >> the unconstrained case with BWC. >> >> >> >> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on >> >> measurements for BWC_JL, with (%d) being the relative difference to their >> >> BWC counterparts. >> >> >> >> W1: BWC vs BWC_JL is very similar. >> >> BWC vs BWC_JL >> >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 >> >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] >> >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] >> >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] >> >> >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 >> >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] >> >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] >> >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] >> >> >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182 >> >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] >> >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] >> >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] >> >> >> >> Now this is rather odd, almost across the board we're seeing the expected >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC >> >> price. The fact that wall-time has scaled equivalently with cycles roughly >> >> rules out the cycles counter being off. >> >> > > if i understand your results, for barcelona you did see an improvement > in cycles and eslapsed time with jump labels for unconstrained? > Under W2, yes. >> >> We are seeing the expected behavior in the bandwidth enabled case; >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch >> >> and instruction which shows up on all the numbers above. >> >> >> >> With respect to compiler mangling the text is essentially unchanged in size. >> >> One lurking suspicion is whether the inserted nops have perturbed some of the >> >> jmp/branch alignments? > > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who > worked on the 'asm goto' in gcc. > >> >> >> >> text data bss dec hex filename >> >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label >> >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label >> >> > > the other thing here is that vmlinux.jump_label includes the extra > kernel/jump_label.o file, so you can sort of subtract the text size of > that file to do a fair comparison. Even without doing that it's only a 1.00004% change in text size. I was just making the inference that if it's gcc mangling it's likely in the layout/alignment. > > Also, I would have expected the data section to have increased more with > jump labels enabled. Are tracepoints disabled (a current user of jump > labels). Yeah -- Tracing is enabled so the BWC build should have labels already; this likely accounts for the small increase noted above. > >> >> I have checked to make sure that the right instructions are being patched in >> >> at run-time. I've also pulled a fully patched jump_label out of the kernel >> >> into a userspace test (and benchmarked it directly under perf). The results >> >> here are also exactly as expected. >> >> >> >> e.g. >> >> Performance counter stats for './jump_test': >> >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles >> >> Performance counter stats for './jump_test 1': >> >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles >> >> > > what no-op did you use in userspace? I wouldn't think the no-op choice > would make any difference though...At compile time we use a 'jmp 0', and > then at boot we dynamically patch the 'jmp 0' with the no-op we think works > best... > Sorry -- what I meant here is I pulled the run-time chosen "best" nop out of /proc/kcore and tested a tight loop about a <JL><RET><COND><RET> sequence (e.g. cfs_rq_throttled()) with JL being the nop and jmp respectively. Specifically for Westmere this ends up being K8_NOP5 -- 0x666666D0 > thanks, > > -Jason > >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case >> >> looks really good. >> >> >> >> Any thoughts Jason? >> >> >> > >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code >> > more optimal. >> > >> >> Ah I should have mentioned that was one of the holes I stared down: >> >> Builds were -O2 (gcc-4.6.1) and >> $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set >> >> Same kernel image across all platforms. >> >> >> >> >> >> >> > thanks, >> > >> > -Jason >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-22 1:38 ` Paul Turner @ 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Jason Baron @ 2011-07-27 21:58 UTC (permalink / raw) To: Paul Turner Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, Jul 21, 2011 at 06:38:01PM -0700, Paul Turner wrote: > On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote: > > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote: > >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: > >> > rth@redhat.com > >> > Bcc: > >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead > >> > when bandwidth control is inactive > >> > Reply-To: > >> > In-Reply-To: <20110721184758.403388616@google.com> > >> > > >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: > >> >> So I'm seeing some strange costs associated with jump_labels; while on paper > >> >> the branches and instructions retired improves (as expected) we're taking an > >> >> unexpected hit in IPC. > >> >> > >> >> [From the initial mail we have workloads: > >> >> mkdir -p /cgroup/cpu/test > >> >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) > >> >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" > >> >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" > >> >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" > >> >> ] > >> >> > >> >> To make some of the figures more clear: > >> >> > >> >> Legend: > >> >> !BWC = tip + bwc, BWC compiled out > >> >> BWC = tip + bwc > >> >> BWC_JL = tip + bwc + jump label (this patch) > >> >> > >> >> > >> >> Now, comparing under W1 we see: > >> >> W1: BWC vs BWC_JL > >> >> instructions cycles branches elapsed > >> >> --------------------------------------------------------------------------------------------------------------------- > >> >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] > >> >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] > >> >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] > >> >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] > >> >> > >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] > >> >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] > >> >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] > >> >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] > >> >> > >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] > >> >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] > >> >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] > >> >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] > >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in > >> >> the unconstrained case with BWC. > >> >> > >> >> > >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on > >> >> measurements for BWC_JL, with (%d) being the relative difference to their > >> >> BWC counterparts. > >> >> > >> >> W1: BWC vs BWC_JL is very similar. > >> >> BWC vs BWC_JL > >> >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 > >> >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] > >> >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] > >> >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] > >> >> > >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 > >> >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] > >> >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] > >> >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] > >> >> > >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182 > >> >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] > >> >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] > >> >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] > >> >> > >> >> Now this is rather odd, almost across the board we're seeing the expected > >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC > >> >> price. The fact that wall-time has scaled equivalently with cycles roughly > >> >> rules out the cycles counter being off. > >> >> > > > > if i understand your results, for barcelona you did see an improvement > > in cycles and eslapsed time with jump labels for unconstrained? > > > > Under W2, yes. > > >> >> We are seeing the expected behavior in the bandwidth enabled case; > >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch > >> >> and instruction which shows up on all the numbers above. > >> >> > >> >> With respect to compiler mangling the text is essentially unchanged in size. > >> >> One lurking suspicion is whether the inserted nops have perturbed some of the > >> >> jmp/branch alignments? > > > > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who > > worked on the 'asm goto' in gcc. > > > >> >> > >> >> text data bss dec hex filename > >> >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label > >> >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label > >> >> > > > > the other thing here is that vmlinux.jump_label includes the extra > > kernel/jump_label.o file, so you can sort of subtract the text size of > > that file to do a fair comparison. > > Even without doing that it's only a 1.00004% change in text size. > > I was just making the inference that if it's gcc mangling it's likely > in the layout/alignment. > > > > > Also, I would have expected the data section to have increased more with > > jump labels enabled. Are tracepoints disabled (a current user of jump > > labels). > > Yeah -- Tracing is enabled so the BWC build should have labels > already; this likely accounts for the small increase noted above. > > > > >> >> I have checked to make sure that the right instructions are being patched in > >> >> at run-time. I've also pulled a fully patched jump_label out of the kernel > >> >> into a userspace test (and benchmarked it directly under perf). The results > >> >> here are also exactly as expected. > >> >> > >> >> e.g. > >> >> Performance counter stats for './jump_test': > >> >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles > >> >> Performance counter stats for './jump_test 1': > >> >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles > >> >> > > > > what no-op did you use in userspace? I wouldn't think the no-op choice > > would make any difference though...At compile time we use a 'jmp 0', and > > then at boot we dynamically patch the 'jmp 0' with the no-op we think works > > best... > > > > Sorry -- what I meant here is I pulled the run-time chosen "best" nop > out of /proc/kcore and tested a > tight loop about a <JL><RET><COND><RET> sequence (e.g. > cfs_rq_throttled()) with JL being the nop and jmp respectively. > > Specifically for Westmere this ends up being K8_NOP5 -- 0x666666D0 > > > thanks, > > > > -Jason > > > >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case > >> >> looks really good. > >> >> > >> >> Any thoughts Jason? > >> >> > >> > > >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when > >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code > >> > more optimal. > >> > > >> > >> Ah I should have mentioned that was one of the holes I stared down: > >> > >> Builds were -O2 (gcc-4.6.1) and > >> $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE > >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set > >> > >> Same kernel image across all platforms. > >> > >> Hi Paul, Ok, I think I finally tracked this down. It may seem a bit crazy, but when we are getting down to cycle counting like this, it seems that the link order in the kernel/Makefile can make difference. I had the jump_label.o listed after the core files, whereas all the code in jump_label.o is really slow path code (used when toggling branch values). As follows: --- a/kernel/Makefile +++ b/kernel/Makefile @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ - async.o range.o jump_label.o + async.o range.o obj-y += groups.o ifdef CONFIG_FUNCTION_TRACER @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o obj-$(CONFIG_PADATA) += padata.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o +obj-$(CONFIG_JUMP_LABEL) += jump_label.o ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is I've tested the patch using a single 'static_branch()' in the getppid() path, and basically running tight loops of calls to getppid(). Before, the patch, I was seeing results similar to what you reported, after the patch, things improved for all metrics. Here are my results for the branch disabled case: With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) 4,592,334,954 cycles ( +- 0.046% ) 751,634,470 branches ( +- 0.000% ) 1.722635797 seconds time elapsed ( +- 0.046% ) Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) 4,622,210,580 cycles ( +- 0.012% ) 771,662,904 branches ( +- 0.000% ) 1.734341454 seconds time elapsed ( +- 0.022% ) So all of the measured metrics improved in the jump labels case b/w 0.5% - 2.5%. I'm curious to see what you find with this patch. Thanks, -Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-27 21:58 ` Jason Baron @ 2011-08-05 3:53 ` Paul Turner 2011-08-05 7:21 ` Peter Zijlstra 2011-08-05 3:55 ` Paul Turner 2011-08-05 8:30 ` Peter Zijlstra 2 siblings, 1 reply; 13+ messages in thread From: Paul Turner @ 2011-08-05 3:53 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth < snip> > > Hi Paul, > > Ok, I think I finally tracked this down. It may seem a bit crazy, but > when we are getting down to cycle counting like this, it seems that the > link order in the kernel/Makefile can make difference. I had the > jump_label.o listed after the core files, whereas all the code in > jump_label.o is really slow path code (used when toggling branch > values). As follows: > > > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > - async.o range.o jump_label.o > + async.o range.o > obj-y += groups.o > > ifdef CONFIG_FUNCTION_TRACER > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > obj-$(CONFIG_PADATA) += padata.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) > # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > > > I've tested the patch using a single 'static_branch()' in the getppid() path, > and basically running tight loops of calls to getppid(). Before, the > patch, I was seeing results similar to what you reported, after the > patch, things improved for all metrics. Here are my results for the > branch disabled case: > > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) > 4,592,334,954 cycles ( +- 0.046% ) > 751,634,470 branches ( +- 0.000% ) > > 1.722635797 seconds time elapsed ( +- 0.046% ) > > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) > 4,622,210,580 cycles ( +- 0.012% ) > 771,662,904 branches ( +- 0.000% ) > > 1.734341454 seconds time elapsed ( +- 0.022% ) > > > So all of the measured metrics improved in the jump labels case b/w > 0.5% - 2.5%. > > I'm curious to see what you find with this patch. > > Thanks, > > -Jason > Hi Jason, Thanks for taking a look at this. Sorry, this took a few days to benchmark all the permutations and we had some issues with internal proxies which interrupted benchmarking runs. Results and some analysis follow. [ Key: npo_XXX = with CONFIG_JUMP_LABEL, without link order patch (no patched order) po_XXX = with CONFIG_JUMP_LABEL, with link order patch (patched order) nojl_XXX = without CONFIG_JUMP_LABEL Where "XXX" is head: tip (c5bafb3) without patch series cfs: tip + patch series - jump_label patch cfs_jl: tip + patch series + jump_label for unconstrained Test was repeated 3 times, each run was 50 repeats w/ typically ~<0.1 in-test variance on reported output ] Considering just jump labels in tip, comparing against HEAD w/ !CONFIG_JUMP_LABEL instructions cycles branches elapsed --------------------------------------------------------------------------------------------------------------------- Westmere: njl_head.1 798832892 722624737 145375836 0.203218936 [baseline] njl_head.2 798888783 (+0.01) 746118188 (+3.25) 145386807 (+0.01) 0.208573683 (-2.18) njl_head.3 798864253 (+0.00) 731537139 (+1.23) 145382747 (+0.00) 0.204098175 (-4.28) npo_head.1 797033521 (-0.23) 731239359 (+1.19) 144571358 (-0.55) 0.206910496 (-2.96) npo_head.2 797166434 (-0.21) 728926020 (+0.87) 144603465 (-0.53) 0.202906392 (-4.84) npo_head.3 797165370 (-0.21) 725930458 (+0.46) 144603438 (-0.53) 0.202118274 (-5.21) po_head.1 797019904 (-0.23) 699008145 (-3.27) 144567652 (-0.56) 0.197272615 (-7.48) po_head.2 797037682 (-0.22) 705732419 (-2.34) 144572115 (-0.55) 0.197101692 (-7.56) po_head.3 797079804 (-0.22) 698007668 (-3.41) 144580964 (-0.55) 0.194871253 (-8.61) Barcelona: njl_head.1 816842028 748362637 147462095 0.341654152 njl_head.2 816849735 (+0.00) 748480742 (+0.02) 147462652 (+0.00) 0.341450734 (-2.90) njl_head.3 816834963 (-0.00) 747083797 (-0.17) 147460200 (-0.00) 0.340802353 (-3.09) npo_head.1 815068563 (-0.22) 775012690 (+3.56) 146661357 (-0.54) 0.353797321 (+0.61) npo_head.2 815033261 (-0.22) 759613364 (+1.50) 146654106 (-0.55) 0.346462671 (-1.48) npo_head.3 815029611 (-0.22) 762660196 (+1.91) 146654169 (-0.55) 0.347565129 (-1.16) po_head.1 815026489 (-0.22) 767229109 (+2.52) 146653376 (-0.55) 0.350241833 (-0.40) po_head.2 815035127 (-0.22) 770224495 (+2.92) 146654019 (-0.55) 0.351352092 (-0.09) po_head.3 815109904 (-0.21) 774954096 (+3.55) 146662020 (-0.54) 0.353505054 (+0.53) With the patch to fix link-order we're typically faster and it's probably time to modulate the configs so we get CONFIG_JUMP_LABEL by default when CC_HAS_ASM_GOTO. Considering Bandwidth control, comparing vs HEAD w/ CONFIG_JUMP_LABEL: instructions cycles branches elapsed --------------------------------------------------------------------------------------------------------------------- Westmere: po_head.1 797019904 699008145 144567652 0.197272615 [Baseline] po_head.2 797037682 (+0.00) 705732419 (+0.96) 144572115 (+0.00) 0.197101692 (-4.91) po_head.3 797079804 (+0.01) 698007668 (-0.14) 144580964 (+0.01) 0.194871253 (-5.98) njl_cfs.1 802649718 (+0.71) 708143552 (+1.31) 146577437 (+1.39) 0.198770168 (-4.10) njl_cfs.2 802679078 (+0.71) 707486608 (+1.21) 146582628 (+1.39) 0.197890812 (-4.53) njl_cfs.3 802647500 (+0.71) 704770712 (+0.82) 146578141 (+1.39) 0.196742304 (-5.08) npo_cfs.1 800661523 (+0.46) 724068093 (+3.59) 145774786 (+0.83) 0.204632700 (-1.27) npo_cfs.2 800646997 (+0.46) 718884486 (+2.84) 145772293 (+0.83) 0.201248482 (-2.91) npo_cfs.3 800783171 (+0.47) 725140326 (+3.74) 145804350 (+0.86) 0.203266025 (-1.93) npo_cfs_jl.1 797304605 (+0.04) 687741762 (-1.61) 143666256 (-0.62) 0.194302293 (-6.26) npo_cfs_jl.2 797446281 (+0.05) 694066715 (-0.71) 143700065 (-0.60) 0.194212118 (-6.30) npo_cfs_jl.3 797374495 (+0.04) 697561774 (-0.21) 143682692 (-0.61) 0.194935111 (-5.95) po_cfs.1 800631004 (+0.45) 715819643 (+2.41) 145769677 (+0.83) 0.200007036 (-3.51) po_cfs.2 800642622 (+0.45) 698569729 (-0.06) 145769973 (+0.83) 0.194625680 (-6.10) po_cfs.3 800752778 (+0.47) 707282749 (+1.18) 145798992 (+0.85) 0.197047366 (-4.93) po_cfs_jl.1 797306617 (+0.04) 686329256 (-1.81) 143666659 (-0.62) 0.193107369 (-6.83) po_cfs_jl.2 797434478 (+0.05) 677865445 (-3.02) 143697712 (-0.60) 0.189314824 (-8.66) po_cfs_jl.3 797299055 (+0.04) 686371679 (-1.81) 143665758 (-0.62) 0.191859014 (-7.44) Barcelona: po_head.1 815026489 767229109 146653376 0.350241833 [Baseline] po_head.2 815035127 (+0.00) 770224495 (+0.39) 146654019 (+0.00) 0.351352092 (-2.47) po_head.3 815109904 (+0.01) 774954096 (+1.01) 146662020 (+0.01) 0.353505054 (-1.87) njl_cfs.1 820647075 (+0.69) 756895773 (-1.35) 148663929 (+1.37) 0.345563962 (-4.07) njl_cfs.2 820672501 (+0.69) 761520373 (-0.74) 148667815 (+1.37) 0.347529253 (-3.53) njl_cfs.3 820664350 (+0.69) 763400895 (-0.50) 148666126 (+1.37) 0.348337223 (-3.30) npo_cfs.1 818629349 (+0.44) 758306455 (-1.16) 147854452 (+0.82) 0.346678486 (-3.77) npo_cfs.2 818829256 (+0.47) 768393448 (+0.15) 147891099 (+0.84) 0.350678075 (-2.65) npo_cfs.3 818697806 (+0.45) 772218715 (+0.65) 147866720 (+0.83) 0.352333672 (-2.20) npo_cfs_jl.1 815343935 (+0.04) 760127157 (-0.93) 145753233 (-0.61) 0.347184970 (-3.62) npo_cfs_jl.2 815415786 (+0.05) 775772068 (+1.11) 145762961 (-0.61) 0.353965833 (-1.74) npo_cfs_jl.3 815403187 (+0.05) 764048918 (-0.41) 145761012 (-0.61) 0.348619922 (-3.23) po_cfs.1 819204964 (+0.51) 767156385 (-0.01) 147959727 (+0.89) 0.350737982 (-2.64) po_cfs.2 818665676 (+0.45) 764324366 (-0.38) 147860788 (+0.82) 0.348814489 (-3.17) po_cfs.3 818661849 (+0.45) 752288492 (-1.95) 147859717 (+0.82) 0.343294319 (-4.70) po_cfs_jl.1 815336908 (+0.04) 765760248 (-0.19) 145755155 (-0.61) 0.349608614 (-2.95) po_cfs_jl.2 815322295 (+0.04) 765613685 (-0.21) 145751972 (-0.61) 0.349321663 (-3.03) po_cfs_jl.3 815310833 (+0.03) 759647967 (-0.99) 145750118 (-0.62) 0.346607639 (-3.78) Thanks to the magic of compiler re-organization we now report zero overhead, in fact a speed-up is realized. I will re-post v7.3 with: - rebase to minor changes in tip - removing RFT from adding jump_labels to CFS - additional hierarchical period constraint Thanks for looking into this Jason! - Paul ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 3:53 ` Paul Turner @ 2011-08-05 7:21 ` Peter Zijlstra 0 siblings, 0 replies; 13+ messages in thread From: Peter Zijlstra @ 2011-08-05 7:21 UTC (permalink / raw) To: Paul Turner Cc: Jason Baron, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, 2011-08-04 at 20:53 -0700, Paul Turner wrote: > > I will re-post v7.3 with: > - rebase to minor changes in tip > - removing RFT from adding jump_labels to CFS > - additional hierarchical period constraint Could you rebase to -tip + my patches, most of your previous set is already queued there. The reason its not in -tip is is because the merge window fallout still has -tip in a somewhat shaky state. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner @ 2011-08-05 3:55 ` Paul Turner 2011-08-05 18:28 ` Jason Baron 2011-08-05 8:30 ` Peter Zijlstra 2 siblings, 1 reply; 13+ messages in thread From: Paul Turner @ 2011-08-05 3:55 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > - async.o range.o jump_label.o > + async.o range.o > obj-y += groups.o > > ifdef CONFIG_FUNCTION_TRACER > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > obj-$(CONFIG_PADATA) += padata.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) > # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > Tested-by: Paul Turner <pjt@google.com> Let me know if you need any result tables for the actual commit msg. Same goes for making CONFIG_JUMP_LABEL equivalent to default in CC_HAS_ASM_GOTO case (at least on x86 anyway). > > I've tested the patch using a single 'static_branch()' in the getppid() path, > and basically running tight loops of calls to getppid(). Before, the > patch, I was seeing results similar to what you reported, after the > patch, things improved for all metrics. Here are my results for the > branch disabled case: > > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) > 4,592,334,954 cycles ( +- 0.046% ) > 751,634,470 branches ( +- 0.000% ) > > 1.722635797 seconds time elapsed ( +- 0.046% ) > > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) > 4,622,210,580 cycles ( +- 0.012% ) > 771,662,904 branches ( +- 0.000% ) > > 1.734341454 seconds time elapsed ( +- 0.022% ) > > > So all of the measured metrics improved in the jump labels case b/w > 0.5% - 2.5%. > > I'm curious to see what you find with this patch. > > Thanks, > > -Jason > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 3:55 ` Paul Turner @ 2011-08-05 18:28 ` Jason Baron 0 siblings, 0 replies; 13+ messages in thread From: Jason Baron @ 2011-08-05 18:28 UTC (permalink / raw) To: Paul Turner Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth, rostedt On Thu, Aug 04, 2011 at 08:55:08PM -0700, Paul Turner wrote: > > --- a/kernel/Makefile > > +++ b/kernel/Makefile > > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > > - async.o range.o jump_label.o > > + async.o range.o > > obj-y += groups.o > > > > ifdef CONFIG_FUNCTION_TRACER > > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > > obj-$(CONFIG_PADATA) += padata.o > > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > > > ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) > > # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > > > > Tested-by: Paul Turner <pjt@google.com> > > Let me know if you need any result tables for the actual commit msg. Hi Paul, Thanks for taking the time test this :) I'll post the patch shortly with my own testing results. Hopefully, it can still be considered for 3.1 b/c of the non-invasive nature of the patch... > Same goes for making CONFIG_JUMP_LABEL equivalent to default in > CC_HAS_ASM_GOTO case (at least on x86 anyway). > I originally had CONFIG_JUMP_LABEL implicitly turned on, but we ran into a 32-bit compiler issue that was causing random, nasty crashes. That issue has since been resolved in gcc, but we might need to update the have CC_HAS_ASM_GOTO check to deal with that case better. Currently, we're using the '-maccumulate-outgoing-args' gcc option to work around the issue for 32 bit x86 (see: arch/x86/Makefile_32.cpu). With the jump label interface somewhat stabilizing (I say somewhat, b/c Peter brought up a good use case in the scheduler that it currently doesn't address, but which we should be able to support without too much churn) and these testing results, I think it might make sense to consider turning it on by default for 3.2. thoughts? Thanks, -Jason > > > > > I've tested the patch using a single 'static_branch()' in the getppid() path, > > and basically running tight loops of calls to getppid(). Before, the > > patch, I was seeing results similar to what you reported, after the > > patch, things improved for all metrics. Here are my results for the > > branch disabled case: > > > > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: > > > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > > > 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) > > 4,592,334,954 cycles ( +- 0.046% ) > > 751,634,470 branches ( +- 0.000% ) > > > > 1.722635797 seconds time elapsed ( +- 0.046% ) > > > > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: > > > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > > > 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) > > 4,622,210,580 cycles ( +- 0.012% ) > > 771,662,904 branches ( +- 0.000% ) > > > > 1.734341454 seconds time elapsed ( +- 0.022% ) > > > > > > So all of the measured metrics improved in the jump labels case b/w > > 0.5% - 2.5%. > > > > I'm curious to see what you find with this patch. > > > > Thanks, > > > > -Jason > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner 2011-08-05 3:55 ` Paul Turner @ 2011-08-05 8:30 ` Peter Zijlstra 2011-08-05 15:11 ` Richard Henderson 2 siblings, 1 reply; 13+ messages in thread From: Peter Zijlstra @ 2011-08-05 8:30 UTC (permalink / raw) To: Jason Baron Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Wed, 2011-07-27 at 17:58 -0400, Jason Baron wrote: > Ok, I think I finally tracked this down. It may seem a bit crazy, but > when we are getting down to cycle counting like this, it seems that the > link order in the kernel/Makefile can make difference. I had the > jump_label.o listed after the core files, whereas all the code in > jump_label.o is really slow path code (used when toggling branch > values). As follows: > > > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > - async.o range.o jump_label.o > + async.o range.o > obj-y += groups.o > > ifdef CONFIG_FUNCTION_TRACER > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > obj-$(CONFIG_PADATA) += padata.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o OK, so _WHY_ does that make a difference and will a next version of gnu-binutils not mess that up? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 8:30 ` Peter Zijlstra @ 2011-08-05 15:11 ` Richard Henderson 2011-08-05 15:14 ` Peter Zijlstra 2011-08-05 15:24 ` Jason Baron 0 siblings, 2 replies; 13+ messages in thread From: Richard Henderson @ 2011-08-05 15:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On 08/05/2011 01:30 AM, Peter Zijlstra wrote: > OK, so _WHY_ does that make a difference and will a next version of > gnu-binutils not mess that up? The Why is micro-architectual, and I can't answer that. But ld will never re-order the files as given on the command-line. There are too many functions and tables that are constructed piece-wise from input sections; re-ordering them would change the semantics of the program. r~ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 15:11 ` Richard Henderson @ 2011-08-05 15:14 ` Peter Zijlstra 2011-08-05 15:24 ` Jason Baron 1 sibling, 0 replies; 13+ messages in thread From: Peter Zijlstra @ 2011-08-05 15:14 UTC (permalink / raw) To: Richard Henderson Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On Fri, 2011-08-05 at 08:11 -0700, Richard Henderson wrote: > On 08/05/2011 01:30 AM, Peter Zijlstra wrote: > > OK, so _WHY_ does that make a difference and will a next version of > > gnu-binutils not mess that up? > > The Why is micro-architectual, and I can't answer that. > > But ld will never re-order the files as given on the command-line. > There are too many functions and tables that are constructed > piece-wise from input sections; re-ordering them would change > the semantics of the program. Right, so I was wondering about things like whole-program-optimization passes at link time. Since I've no clue why the proposed patch does what it does, its hard to say what invariant is needed to be kept. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 15:11 ` Richard Henderson 2011-08-05 15:14 ` Peter Zijlstra @ 2011-08-05 15:24 ` Jason Baron 1 sibling, 0 replies; 13+ messages in thread From: Jason Baron @ 2011-08-05 15:24 UTC (permalink / raw) To: Richard Henderson, a.p.zijlstra Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On Fri, Aug 05, 2011 at 08:11:15AM -0700, Richard Henderson wrote: > On 08/05/2011 01:30 AM, Peter Zijlstra wrote: > > OK, so _WHY_ does that make a difference and will a next version of > > gnu-binutils not mess that up? > > The Why is micro-architectual, and I can't answer that. In tracking this down, I eventually found that just having the jump_label.o file compiled into the kernel, but not actually using the static_branch(), or 'asm goto' anywhere, led to a performance hit. Thus, the compiler or the 'asm goto' itself wasn't actually causing any degradation. Since the jump_label.o file is only slow-path code, it can be moved away from core or heavily called kernel routines. I suspect this is probably an icache issue, but I can't say for sure. Thanks, -Jason > > But ld will never re-order the files as given on the command-line. > There are too many functions and tables that are constructed > piece-wise from input sections; re-ordering them would change > the semantics of the program. > > > r~ ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-08-05 18:29 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-22 0:32 Jason Baron 2011-07-22 0:57 ` Paul Turner 2011-07-22 1:17 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron 2011-07-22 1:38 ` Paul Turner 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner 2011-08-05 7:21 ` Peter Zijlstra 2011-08-05 3:55 ` Paul Turner 2011-08-05 18:28 ` Jason Baron 2011-08-05 8:30 ` Peter Zijlstra 2011-08-05 15:11 ` Richard Henderson 2011-08-05 15:14 ` Peter Zijlstra 2011-08-05 15:24 ` Jason Baron
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).