* [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
@ 2011-07-29 1:38 Dave Chinner
2011-07-29 3:30 ` Andrew Lutomirski
0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2011-07-29 1:38 UTC (permalink / raw)
To: linux-kernel; +Cc: luto, mingo
Hi folks,
It-s merge window again, which means I'm doing my usual "where did
the XFS performance go" bisects again. The usual workload:
$ time sudo mkfs.xfs -f -d agcount=32 -l size=131072b /dev/vda
meta-data=/dev/vda isize=256 agcount=32, agsize=134217728 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=4294967296, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=131072, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
real 0m4.496s
user 0m0.012s
sys 0m0.196s
$ sudo mount -o nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ cd /home/dave/src/fs_mark-3.3/
$ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 63 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
On a 8p 4GB RAM VM and a 16TB filesystem. On 3.0, this completes in:
0 48000000 0 110068.2 7165149
0 48800000 0 112275.1 7523326
0 49600000 0 111725.3 9002077
0 50400000 0 109074.4 7111665
246.67user 2903.42system 8:16.96elapsed 633%CPU (0avgtext+0avgdata 82544maxresident)k
0inputs+0outputs (1639major+2603202minor)pagefaults 0swaps
$
The completion time over multiple runs is ~8m10s +/-5s, and the user
CPU time is roughly 245s +/-5s
Enter 5cec93c2 ("x86-64: Emulate legacy vsyscalls") and the result
ends up at:
0 48000000 0 108975.2 9507483
0 48800000 0 114676.5 8604471
0 49600000 0 98062.0 8921525
0 50400000 0 103864.7 8218302
287.35user 2933.90system 8:33.11elapsed 627%CPU (0avgtext+0avgdata 82560maxresident)k
0inputs+0outputs (1664major+2603457minor)pagefaults 0swaps
Noticable slow wall time with more variance - it's at 8m30s +/-10s,
and the user CPU time is at 290s +/-5s. So the benchmark is slower to
complete and consumes 20% more CPU in userspace. The following commit
c971294 x86-64: ("Improve vsyscall emulation CS and RIP handling")
also contributes to the slowdown a bit.
FYI, fs_mark does a lot of gettimeofday() calls - one before and
after every syscall that does filesystem work so it can calculate
the syscall times and the amount of time spent not doing syscalls.
I'm assuming this is the problem based on the commit message.
Issuing hundreds of thousands of getimeofday calls per second spread
across multiple CPUs is not uncommon, especially in benchmark or
performance measuring software. If that is the cause, then these
commits add -significant- overhead to that process.
Assuming this is the problem, can this be fixed without requiring
the whole world having to wait for the current glibc dev tree to
filter down into distro repositories?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-07-29 1:38 [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% Dave Chinner @ 2011-07-29 3:30 ` Andrew Lutomirski 2011-07-29 7:24 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Andrew Lutomirski @ 2011-07-29 3:30 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, mingo On Thu, Jul 28, 2011 at 9:38 PM, Dave Chinner <david@fromorbit.com> wrote: > Hi folks, > > It-s merge window again, which means I'm doing my usual "where did > the XFS performance go" bisects again. The usual workload: > [...] > > The completion time over multiple runs is ~8m10s +/-5s, and the user > CPU time is roughly 245s +/-5s > > Enter 5cec93c2 ("x86-64: Emulate legacy vsyscalls") and the result > ends up at: > > 0 48000000 0 108975.2 9507483 > 0 48800000 0 114676.5 8604471 > 0 49600000 0 98062.0 8921525 > 0 50400000 0 103864.7 8218302 > 287.35user 2933.90system 8:33.11elapsed 627%CPU (0avgtext+0avgdata 82560maxresident)k > 0inputs+0outputs (1664major+2603457minor)pagefaults 0swaps > > Noticable slow wall time with more variance - it's at 8m30s +/-10s, > and the user CPU time is at 290s +/-5s. So the benchmark is slower to > complete and consumes 20% more CPU in userspace. The following commit > c971294 x86-64: ("Improve vsyscall emulation CS and RIP handling") > also contributes to the slowdown a bit. I'm surprised that the second commit had any effect. > > FYI, fs_mark does a lot of gettimeofday() calls - one before and > after every syscall that does filesystem work so it can calculate > the syscall times and the amount of time spent not doing syscalls. > I'm assuming this is the problem based on the commit message. > Issuing hundreds of thousands of getimeofday calls per second spread > across multiple CPUs is not uncommon, especially in benchmark or > performance measuring software. If that is the cause, then these > commits add -significant- overhead to that process. I put some work into speeding up vdso timing in 3.0. As of Linus' tree now: # test_vsyscall bench Benchmarking syscall gettimeofday ... 7068000 loops in 0.50004s = 70.75 nsec / loop Benchmarking vdso gettimeofday ... 23868000 loops in 0.50002s = 20.95 nsec / loop Benchmarking vsyscall gettimeofday ... 2106000 loops in 0.50004s = 237.44 nsec / loop Benchmarking syscall CLOCK_MONOTONIC ... 9018000 loops in 0.50002s = 55.45 nsec / loop Benchmarking vdso CLOCK_MONOTONIC ... 30867000 loops in 0.50002s = 16.20 nsec / loop Benchmarking syscall time ... 12962000 loops in 0.50001s = 38.58 nsec / loop Benchmarking vdso time ... 286269000 loops in 0.50000s = 1.75 nsec / loop Benchmarking vsyscall time ... 2412000 loops in 0.50012s = 207.35 nsec / loop Benchmarking vdso getcpu ... 40265000 loops in 0.50001s = 12.42 nsec / loop Benchmarking vsyscall getcpu ... 2334000 loops in 0.50012s = 214.27 nsec / loop Benchmarking dummy syscall ... 14927000 loops in 0.50000s = 33.50 nsec / loop So clock_gettime(CLOCK_MONOTONIC) is faster, more correct, and more precise than gettimeofday. IMO you should fix your benchmark :) More seriously, though, I think it's a decent tradeoff to slow down some extremely vsyscall-heavy legacy workloads to remove the last bit of nonrandomized executable code. The only way this should show up to any significant extent is on modern rdtsc-using systems that make a huge number of vsyscalls. On older machines, even the cost of the trap should be smallish compared to the cost of HPET / acpi_pm access. > > Assuming this is the problem, can this be fixed without requiring > the whole world having to wait for the current glibc dev tree to > filter down into distro repositories? How old is your glibc? gettimeofday has used the vdso since: commit 9c6f6953fda96b49c8510a879304ea4222ea1781 Author: Ulrich Drepper <drepper@redhat.com> Date: Mon Aug 13 18:47:42 2007 +0000 * sysdeps/unix/sysv/linux/x86_64/libc-start.c (_libc_vdso_platform_setup): If vDSO is not available point __vdso_gettimeofday to the vsyscall. * sysdeps/unix/sysv/linux/x86_64/gettimeofday.S [SHARED]: Use __vdso_gettimeofday instead of vsyscall. We could play really evil games to speed it up a bit. For example, I think it's OK for int 0xcc to clobber rcx and r11, enabling this abomination: diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index e13329d..6edbde0 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1111,8 +1111,25 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug zeroentry coprocessor_error do_coprocessor_error errorentry alignment_check do_alignment_check zeroentry simd_coprocessor_error do_simd_coprocessor_error -zeroentry emulate_vsyscall do_emulate_vsyscall +ENTRY(emulate_vsyscall) + INTR_FRAME + PARAVIRT_ADJUST_EXCEPTION_FRAME + pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */ + subq $ORIG_RAX-R15, %rsp + CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 + call error_entry + DEFAULT_FRAME 0 + movq %rsp,%rdi /* pt_regs pointer */ + xorl %esi,%esi /* no error code */ + call do_emulate_vsyscall + movq %rax,RAX(%rsp) + movq RSP(%rsp),%rcx + movq %rcx,PER_CPU_VAR(old_rsp) + RESTORE_REST + jmp ret_from_sys_call /* XXX: should check cs */ + CFI_ENDPROC +END(emulate_vsyscall) /* Reload gs selector with exception handling */ /* edi: new selector */ speeds up the gettimeofday emulated vsyscall from 237 ns to 157 ns. This may be the most evil kernel patch I've ever written. But I think it's almost correct and could be made completely correct with only a little bit of additional effort. (I'm not really suggesting this, but it's at least worth some entertainment.) --Andy P.S. Holy cow, iret is slow. Anyone want to ask their Intel / AMD friends to add an instruction just like sysret that pops rcx and r11 from the stack or reads them from non-serialized MSRs? That way we could do this to all of the 64-bit fast path returns. P.P.S. It's kind of tempting to set up a little userspace trampoline that does: popq %r11 popq %rcx ret 128 Page faults could increment rsp by 128 (preserving the red zone), push rip, rcx, and r11, and return via sysretq to the trampoline. This would presumably save 80ns on Sandy Bridge :) ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-07-29 3:30 ` Andrew Lutomirski @ 2011-07-29 7:24 ` Dave Chinner 2011-07-29 12:17 ` Andrew Lutomirski 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2011-07-29 7:24 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: linux-kernel, mingo On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote: > On Thu, Jul 28, 2011 at 9:38 PM, Dave Chinner <david@fromorbit.com> wrote: > > Hi folks, > > > > It-s merge window again, which means I'm doing my usual "where did > > the XFS performance go" bisects again. The usual workload: > > > > [...] > > > > > The completion time over multiple runs is ~8m10s +/-5s, and the user > > CPU time is roughly 245s +/-5s > > > > Enter 5cec93c2 ("x86-64: Emulate legacy vsyscalls") and the result > > ends up at: > > > > 0 48000000 0 108975.2 9507483 > > 0 48800000 0 114676.5 8604471 > > 0 49600000 0 98062.0 8921525 > > 0 50400000 0 103864.7 8218302 > > 287.35user 2933.90system 8:33.11elapsed 627%CPU (0avgtext+0avgdata 82560maxresident)k > > 0inputs+0outputs (1664major+2603457minor)pagefaults 0swaps > > > > Noticable slow wall time with more variance - it's at 8m30s +/-10s, > > and the user CPU time is at 290s +/-5s. So the benchmark is slower to > > complete and consumes 20% more CPU in userspace. The following commit > > c971294 x86-64: ("Improve vsyscall emulation CS and RIP handling") > > also contributes to the slowdown a bit. > > I'm surprised that the second commit had any effect. It's probably noise - the overhead numbers dropping out of fsmark seemed to be slightly higher and the added user time about another 5s on average on top of the 50s the previous patch added. With it taking 15-20min as bisect step, I didn't run more than one test at each commit - the bisect took 4.5 hours as it was to find the commit responsible. So, probably just noise. > > FYI, fs_mark does a lot of gettimeofday() calls - one before and > > after every syscall that does filesystem work so it can calculate > > the syscall times and the amount of time spent not doing syscalls. > > I'm assuming this is the problem based on the commit message. > > Issuing hundreds of thousands of getimeofday calls per second spread > > across multiple CPUs is not uncommon, especially in benchmark or > > performance measuring software. If that is the cause, then these > > commits add -significant- overhead to that process. > > I put some work into speeding up vdso timing in 3.0. As of Linus' tree now: > > # test_vsyscall bench > Benchmarking syscall gettimeofday ... 7068000 loops in > 0.50004s = 70.75 nsec / loop > Benchmarking vdso gettimeofday ... 23868000 loops in > 0.50002s = 20.95 nsec / loop > Benchmarking vsyscall gettimeofday ... 2106000 loops in > 0.50004s = 237.44 nsec / loop How does that compare to 3.0 before these changes? No point telling me how it performs without something to compare it to and it doesn't tell me if gettimeofday actually slowed down or not... > So clock_gettime(CLOCK_MONOTONIC) is faster, more correct, and more > precise than gettimeofday. IMO you should fix your benchmark :) So you're going to say that to everyone who currently uses gettimeofday() a lot? ;) > More seriously, though, I think it's a decent tradeoff to slow down > some extremely vsyscall-heavy legacy workloads to remove the last bit > of nonrandomized executable code. The only way this should show up to > any significant extent is on modern rdtsc-using systems that make a > huge number of vsyscalls. On older machines, even the cost of the > trap should be smallish compared to the cost of HPET / acpi_pm access. > > > > > Assuming this is the problem, can this be fixed without requiring > > the whole world having to wait for the current glibc dev tree to > > filter down into distro repositories? > > How old is your glibc? gettimeofday has used the vdso since: It's 2.11 on the test machine, whatever that translates to. I haven't really changed the base userspace for about 12 months because if I do I invalidate all my historical benchmark results that I use for comparisons. If I have to upgrade it to something more recent (I note that the current libc6 is 2.13 in debian unstable) then I will but there's going to be plenty of people that see this if 2.11 is not recent enough.... > speeds up the gettimeofday emulated vsyscall from 237 ns to 157 ns. I've still got nothing to compare that against... :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-07-29 7:24 ` Dave Chinner @ 2011-07-29 12:17 ` Andrew Lutomirski 2011-07-29 13:26 ` Andrew Lutomirski 0 siblings, 1 reply; 8+ messages in thread From: Andrew Lutomirski @ 2011-07-29 12:17 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, mingo On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote: >> > FYI, fs_mark does a lot of gettimeofday() calls - one before and >> > after every syscall that does filesystem work so it can calculate >> > the syscall times and the amount of time spent not doing syscalls. >> > I'm assuming this is the problem based on the commit message. >> > Issuing hundreds of thousands of getimeofday calls per second spread >> > across multiple CPUs is not uncommon, especially in benchmark or >> > performance measuring software. If that is the cause, then these >> > commits add -significant- overhead to that process. >> >> I put some work into speeding up vdso timing in 3.0. As of Linus' tree now: >> >> # test_vsyscall bench >> Benchmarking syscall gettimeofday ... 7068000 loops in >> 0.50004s = 70.75 nsec / loop >> Benchmarking vdso gettimeofday ... 23868000 loops in >> 0.50002s = 20.95 nsec / loop >> Benchmarking vsyscall gettimeofday ... 2106000 loops in >> 0.50004s = 237.44 nsec / loop > > How does that compare to 3.0 before these changes? No point telling > me how it performs without something to compare it to and it doesn't > tell me if gettimeofday actually slowed down or not... 3.0 would have identical syscall performance and very nearly identical vdso performance (the code is identical but there could be slightly different icache behavior). 3.0's vsyscall would have taken ~22 ns on this hardware. > >> So clock_gettime(CLOCK_MONOTONIC) is faster, more correct, and more >> precise than gettimeofday. IMO you should fix your benchmark :) > > So you're going to say that to everyone who currently uses > gettimeofday() a lot? ;) Actually, you're the first one to notice. I'm hopeful that no non-benchmark workloads will see a significant effect. > >> More seriously, though, I think it's a decent tradeoff to slow down >> some extremely vsyscall-heavy legacy workloads to remove the last bit >> of nonrandomized executable code. The only way this should show up to >> any significant extent is on modern rdtsc-using systems that make a >> huge number of vsyscalls. On older machines, even the cost of the >> trap should be smallish compared to the cost of HPET / acpi_pm access. >> >> > >> > Assuming this is the problem, can this be fixed without requiring >> > the whole world having to wait for the current glibc dev tree to >> > filter down into distro repositories? >> >> How old is your glibc? gettimeofday has used the vdso since: > > It's 2.11 on the test machine, whatever that translates to. I > haven't really changed the base userspace for about 12 months > because if I do I invalidate all my historical benchmark results > that I use for comparisons. 2.11 is from 2009 and appears to contain that commit. Does your workload call time() very frequently? That's the largest slowdown. With the old code, time() took 4-5 ns and with the new code time() is about as slow as gettimeofday(). I suggested having a config option to allow time() to stay fast until glibc 2.14 became widespread, but a few other people disagreed. > > If I have to upgrade it to something more recent (I note that the > current libc6 is 2.13 in debian unstable) then I will but there's > going to be plenty of people that see this if 2.11 is not recent > enough.... If it's time(), that won't help. > >> speeds up the gettimeofday emulated vsyscall from 237 ns to 157 ns. > > I've still got nothing to compare that against... :/ ~22 ns before the changes. Note that this is only on Sandy Bridge. The overhead of syscalls and traps is much higher on Nehalem hardware, and I haven't done much testing on other machines. On Nehalem with HPET on 3.1-ish code, it looks like: Benchmarking syscall gettimeofday ... 612000 loops in 0.50076s = 818.23 nsec / loop Benchmarking vdso gettimeofday ... 832000 loops in 0.50032s = 601.34 nsec / loop Benchmarking vsyscall gettimeofday ... 457000 loops in 0.50056s = 1095.32 nsec / loop With acpi_pm, it's: Benchmarking syscall gettimeofday ... 377000 loops in 0.50007s = 1326.44 nsec / loop Benchmarking vdso gettimeofday ... 377000 loops in 0.50112s = 1329.24 nsec / loop Benchmarking vsyscall gettimeofday ... 316000 loops in 0.50036s = 1583.42 nsec / loop the difference is almost gone because acpi_pm issues a syscall or trap no matter how you issue the gettimeofday call. --Andy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-07-29 12:17 ` Andrew Lutomirski @ 2011-07-29 13:26 ` Andrew Lutomirski 2011-07-31 11:01 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Andrew Lutomirski @ 2011-07-29 13:26 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, mingo On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote: > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote: >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote: >>> > >>> > Assuming this is the problem, can this be fixed without requiring >>> > the whole world having to wait for the current glibc dev tree to >>> > filter down into distro repositories? >>> >>> How old is your glibc? gettimeofday has used the vdso since: >> >> It's 2.11 on the test machine, whatever that translates to. I >> haven't really changed the base userspace for about 12 months >> because if I do I invalidate all my historical benchmark results >> that I use for comparisons. > > 2.11 is from 2009 and appears to contain that commit. Does your > workload call time() very frequently? That's the largest slowdown. > With the old code, time() took 4-5 ns and with the new code time() is > about as slow as gettimeofday(). I suggested having a config option > to allow time() to stay fast until glibc 2.14 became widespread, but a > few other people disagreed. *sigh* fs_mark: fs_mark.o lib_timing.o ${CC} -static -o fs_mark fs_mark.o lib_timing.o Even brand-new glibc still issues vsyscalls when statically linked, and Ulrich has said [1] that he doesn't care that much about performance of statically linked code. How bad would it be to just remove the -static from the makefile? [1] http://sourceware.org/bugzilla/show_bug.cgi?id=12813 --Andy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-07-29 13:26 ` Andrew Lutomirski @ 2011-07-31 11:01 ` Dave Chinner 2011-08-01 12:29 ` Andrew Lutomirski 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2011-07-31 11:01 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: linux-kernel, mingo On Fri, Jul 29, 2011 at 09:26:19AM -0400, Andrew Lutomirski wrote: > On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote: > > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote: > >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote: > >>> > > >>> > Assuming this is the problem, can this be fixed without requiring > >>> > the whole world having to wait for the current glibc dev tree to > >>> > filter down into distro repositories? > >>> > >>> How old is your glibc? gettimeofday has used the vdso since: > >> > >> It's 2.11 on the test machine, whatever that translates to. I > >> haven't really changed the base userspace for about 12 months > >> because if I do I invalidate all my historical benchmark results > >> that I use for comparisons. > > > > 2.11 is from 2009 and appears to contain that commit. Does your > > workload call time() very frequently? That's the largest slowdown. > > With the old code, time() took 4-5 ns and with the new code time() is > > about as slow as gettimeofday(). I suggested having a config option > > to allow time() to stay fast until glibc 2.14 became widespread, but a > > few other people disagreed. > > *sigh* > > fs_mark: fs_mark.o lib_timing.o > ${CC} -static -o fs_mark fs_mark.o lib_timing.o > > Even brand-new glibc still issues vsyscalls when statically linked, > and Ulrich has said [1] that he doesn't care that much about > performance of statically linked code. > > How bad would it be to just remove the -static from the makefile? Results in 270s +-5s user CPU time, so user CPU time is still ~10% up on 3.0 numbers. IOWs, a non-static link roughly halves the regression but doesn't get rid of it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-07-31 11:01 ` Dave Chinner @ 2011-08-01 12:29 ` Andrew Lutomirski 2011-08-01 13:25 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Andrew Lutomirski @ 2011-08-01 12:29 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, mingo On Sun, Jul 31, 2011 at 7:01 AM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Jul 29, 2011 at 09:26:19AM -0400, Andrew Lutomirski wrote: >> On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote: >> > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote: >> >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote: >> >>> > >> >>> > Assuming this is the problem, can this be fixed without requiring >> >>> > the whole world having to wait for the current glibc dev tree to >> >>> > filter down into distro repositories? >> >>> >> >>> How old is your glibc? gettimeofday has used the vdso since: >> >> >> >> It's 2.11 on the test machine, whatever that translates to. I >> >> haven't really changed the base userspace for about 12 months >> >> because if I do I invalidate all my historical benchmark results >> >> that I use for comparisons. >> > >> > 2.11 is from 2009 and appears to contain that commit. Does your >> > workload call time() very frequently? That's the largest slowdown. >> > With the old code, time() took 4-5 ns and with the new code time() is >> > about as slow as gettimeofday(). I suggested having a config option >> > to allow time() to stay fast until glibc 2.14 became widespread, but a >> > few other people disagreed. >> >> *sigh* >> >> fs_mark: fs_mark.o lib_timing.o >> ${CC} -static -o fs_mark fs_mark.o lib_timing.o >> >> Even brand-new glibc still issues vsyscalls when statically linked, >> and Ulrich has said [1] that he doesn't care that much about >> performance of statically linked code. >> >> How bad would it be to just remove the -static from the makefile? > > Results in 270s +-5s user CPU time, so user CPU time is still ~10% > up on 3.0 numbers. IOWs, a non-static link roughly halves the > regression but doesn't get rid of it. Are you sure? I stuck a trace event in do_emulate_vsyscall and it's not getting hit at all in fs_mark, at least on my system. I'll send out the patch tomorrow. --Andy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% 2011-08-01 12:29 ` Andrew Lutomirski @ 2011-08-01 13:25 ` Dave Chinner 0 siblings, 0 replies; 8+ messages in thread From: Dave Chinner @ 2011-08-01 13:25 UTC (permalink / raw) To: Andrew Lutomirski; +Cc: linux-kernel, mingo On Mon, Aug 01, 2011 at 08:29:47AM -0400, Andrew Lutomirski wrote: > On Sun, Jul 31, 2011 at 7:01 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jul 29, 2011 at 09:26:19AM -0400, Andrew Lutomirski wrote: > >> On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote: > >> > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote: > >> >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote: > >> >>> > > >> >>> > Assuming this is the problem, can this be fixed without requiring > >> >>> > the whole world having to wait for the current glibc dev tree to > >> >>> > filter down into distro repositories? > >> >>> > >> >>> How old is your glibc? gettimeofday has used the vdso since: > >> >> > >> >> It's 2.11 on the test machine, whatever that translates to. I > >> >> haven't really changed the base userspace for about 12 months > >> >> because if I do I invalidate all my historical benchmark results > >> >> that I use for comparisons. > >> > > >> > 2.11 is from 2009 and appears to contain that commit. Does your > >> > workload call time() very frequently? That's the largest slowdown. > >> > With the old code, time() took 4-5 ns and with the new code time() is > >> > about as slow as gettimeofday(). I suggested having a config option > >> > to allow time() to stay fast until glibc 2.14 became widespread, but a > >> > few other people disagreed. > >> > >> *sigh* > >> > >> fs_mark: fs_mark.o lib_timing.o > >> ${CC} -static -o fs_mark fs_mark.o lib_timing.o > >> > >> Even brand-new glibc still issues vsyscalls when statically linked, > >> and Ulrich has said [1] that he doesn't care that much about > >> performance of statically linked code. > >> > >> How bad would it be to just remove the -static from the makefile? > > > > Results in 270s +-5s user CPU time, so user CPU time is still ~10% > > up on 3.0 numbers. IOWs, a non-static link roughly halves the > > regression but doesn't get rid of it. > > Are you sure? I stuck a trace event in do_emulate_vsyscall and it's > not getting hit at all in fs_mark, at least on my system. I'll send > out the patch tomorrow. It may be other changes to kernel code that are causing the rest of the regresssion. Kernel code that blows the CPU caches (e.g. direct reclaim LRU scanning) can have a major effect of the userspace CPU time, so it's probably some secondary effect like that I'm seeing. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-08-01 13:25 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-29 1:38 [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% Dave Chinner 2011-07-29 3:30 ` Andrew Lutomirski 2011-07-29 7:24 ` Dave Chinner 2011-07-29 12:17 ` Andrew Lutomirski 2011-07-29 13:26 ` Andrew Lutomirski 2011-07-31 11:01 ` Dave Chinner 2011-08-01 12:29 ` Andrew Lutomirski 2011-08-01 13:25 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox