[3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
@ 2011-07-29  1:38 Dave Chinner
  2011-07-29  3:30 ` Andrew Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2011-07-29  1:38 UTC (permalink / raw)
  To: linux-kernel; +Cc: luto, mingo

Hi folks,

It-s merge window again, which means I'm doing my usual "where did
the XFS performance go" bisects again. The usual workload:

$ time sudo mkfs.xfs -f -d agcount=32 -l size=131072b /dev/vda
meta-data=/dev/vda               isize=256    agcount=32, agsize=134217728 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=4294967296, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

real    0m4.496s
user    0m0.012s
sys     0m0.196s
$ sudo mount -o nobarrier,logbsize=262144,inode64 /dev/vda /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ cd /home/dave/src/fs_mark-3.3/
$ /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 63 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7

On a 8p 4GB RAM VM and a 16TB filesystem. On 3.0, this completes in:

     0     48000000            0     110068.2          7165149
     0     48800000            0     112275.1          7523326
     0     49600000            0     111725.3          9002077
     0     50400000            0     109074.4          7111665
246.67user 2903.42system 8:16.96elapsed 633%CPU (0avgtext+0avgdata 82544maxresident)k
0inputs+0outputs (1639major+2603202minor)pagefaults 0swaps
$

The completion time over multiple runs is ~8m10s +/-5s, and the user
CPU time is roughly 245s +/-5s

Enter 5cec93c2 ("x86-64: Emulate legacy vsyscalls") and the result
ends up at:

     0     48000000            0     108975.2          9507483
     0     48800000            0     114676.5          8604471
     0     49600000            0      98062.0          8921525
     0     50400000            0     103864.7          8218302
287.35user 2933.90system 8:33.11elapsed 627%CPU (0avgtext+0avgdata 82560maxresident)k
0inputs+0outputs (1664major+2603457minor)pagefaults 0swaps

Noticable slow wall time with more variance - it's at 8m30s +/-10s,
and the user CPU time is at 290s +/-5s. So the benchmark is slower to
complete and consumes 20% more CPU in userspace. The following commit
c971294 x86-64: ("Improve vsyscall emulation CS and RIP handling")
also contributes to the slowdown a bit.

FYI, fs_mark does a lot of gettimeofday() calls - one before and
after every syscall that does filesystem work so it can calculate
the syscall times and the amount of time spent not doing syscalls.
I'm assuming this is the problem based on the commit message.
Issuing hundreds of thousands of getimeofday calls per second spread
across multiple CPUs is not uncommon, especially in benchmark or
performance measuring software. If that is the cause, then these
commits add -significant- overhead to that process.

Assuming this is the problem, can this be fixed without requiring
the whole world having to wait for the current glibc dev tree to
filter down into distro repositories?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-07-29  1:38 [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% Dave Chinner
@ 2011-07-29  3:30 ` Andrew Lutomirski
  2011-07-29  7:24   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Lutomirski @ 2011-07-29  3:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, mingo

On Thu, Jul 28, 2011 at 9:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> Hi folks,
>
> It-s merge window again, which means I'm doing my usual "where did
> the XFS performance go" bisects again. The usual workload:
>

[...]

>
> The completion time over multiple runs is ~8m10s +/-5s, and the user
> CPU time is roughly 245s +/-5s
>
> Enter 5cec93c2 ("x86-64: Emulate legacy vsyscalls") and the result
> ends up at:
>
>     0     48000000            0     108975.2          9507483
>     0     48800000            0     114676.5          8604471
>     0     49600000            0      98062.0          8921525
>     0     50400000            0     103864.7          8218302
> 287.35user 2933.90system 8:33.11elapsed 627%CPU (0avgtext+0avgdata 82560maxresident)k
> 0inputs+0outputs (1664major+2603457minor)pagefaults 0swaps
>
> Noticable slow wall time with more variance - it's at 8m30s +/-10s,
> and the user CPU time is at 290s +/-5s. So the benchmark is slower to
> complete and consumes 20% more CPU in userspace. The following commit
> c971294 x86-64: ("Improve vsyscall emulation CS and RIP handling")
> also contributes to the slowdown a bit.

I'm surprised that the second commit had any effect.

>
> FYI, fs_mark does a lot of gettimeofday() calls - one before and
> after every syscall that does filesystem work so it can calculate
> the syscall times and the amount of time spent not doing syscalls.
> I'm assuming this is the problem based on the commit message.
> Issuing hundreds of thousands of getimeofday calls per second spread
> across multiple CPUs is not uncommon, especially in benchmark or
> performance measuring software. If that is the cause, then these
> commits add -significant- overhead to that process.

I put some work into speeding up vdso timing in 3.0.  As of Linus' tree now:

# test_vsyscall bench
Benchmarking  syscall gettimeofday      ...   7068000 loops in
0.50004s =   70.75 nsec / loop
Benchmarking     vdso gettimeofday      ...  23868000 loops in
0.50002s =   20.95 nsec / loop
Benchmarking vsyscall gettimeofday      ...   2106000 loops in
0.50004s =  237.44 nsec / loop

Benchmarking  syscall CLOCK_MONOTONIC   ...   9018000 loops in
0.50002s =   55.45 nsec / loop
Benchmarking     vdso CLOCK_MONOTONIC   ...  30867000 loops in
0.50002s =   16.20 nsec / loop

Benchmarking  syscall time              ...  12962000 loops in
0.50001s =   38.58 nsec / loop
Benchmarking     vdso time              ... 286269000 loops in
0.50000s =    1.75 nsec / loop
Benchmarking vsyscall time              ...   2412000 loops in
0.50012s =  207.35 nsec / loop

Benchmarking     vdso getcpu            ...  40265000 loops in
0.50001s =   12.42 nsec / loop
Benchmarking vsyscall getcpu            ...   2334000 loops in
0.50012s =  214.27 nsec / loop

Benchmarking dummy syscall              ...  14927000 loops in
0.50000s =   33.50 nsec / loop

So clock_gettime(CLOCK_MONOTONIC) is faster, more correct, and more
precise than gettimeofday.  IMO you should fix your benchmark :)


More seriously, though, I think it's a decent tradeoff to slow down
some extremely vsyscall-heavy legacy workloads to remove the last bit
of nonrandomized executable code.  The only way this should show up to
any significant extent is on modern rdtsc-using systems that make a
huge number of vsyscalls.  On older machines, even the cost of the
trap should be smallish compared to the cost of HPET / acpi_pm access.

>
> Assuming this is the problem, can this be fixed without requiring
> the whole world having to wait for the current glibc dev tree to
> filter down into distro repositories?

How old is your glibc?  gettimeofday has used the vdso since:

commit 9c6f6953fda96b49c8510a879304ea4222ea1781
Author: Ulrich Drepper <drepper@redhat.com>
Date:   Mon Aug 13 18:47:42 2007 +0000

    * sysdeps/unix/sysv/linux/x86_64/libc-start.c

        (_libc_vdso_platform_setup): If vDSO is not available point
        __vdso_gettimeofday to the vsyscall.
        * sysdeps/unix/sysv/linux/x86_64/gettimeofday.S [SHARED]: Use
        __vdso_gettimeofday instead of vsyscall.


We could play really evil games to speed it up a bit.  For example, I
think it's OK for int 0xcc to clobber rcx and r11, enabling this
abomination:

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e13329d..6edbde0 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1111,8 +1111,25 @@ zeroentry spurious_interrupt_bug
do_spurious_interrupt_bug
 zeroentry coprocessor_error do_coprocessor_error
 errorentry alignment_check do_alignment_check
 zeroentry simd_coprocessor_error do_simd_coprocessor_error
-zeroentry emulate_vsyscall do_emulate_vsyscall

+ENTRY(emulate_vsyscall)
+       INTR_FRAME
+       PARAVIRT_ADJUST_EXCEPTION_FRAME
+       pushq_cfi $-1           /* ORIG_RAX: no syscall to restart */
+       subq $ORIG_RAX-R15, %rsp
+       CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+       call error_entry
+       DEFAULT_FRAME 0
+       movq %rsp,%rdi          /* pt_regs pointer */
+       xorl %esi,%esi          /* no error code */
+       call do_emulate_vsyscall
+       movq %rax,RAX(%rsp)
+       movq RSP(%rsp),%rcx
+       movq %rcx,PER_CPU_VAR(old_rsp)
+       RESTORE_REST
+       jmp ret_from_sys_call   /* XXX: should check cs */
+       CFI_ENDPROC
+END(emulate_vsyscall)

        /* Reload gs selector with exception handling */
        /* edi:  new selector */

speeds up the gettimeofday emulated vsyscall from 237 ns to 157 ns.


This may be the most evil kernel patch I've ever written.  But I think
it's almost correct and could be made completely correct with only a
little bit of additional effort.  (I'm not really suggesting this, but
it's at least worth some entertainment.)

--Andy

P.S.  Holy cow, iret is slow.  Anyone want to ask their Intel / AMD
friends to add an instruction just like sysret that pops rcx and r11
from the stack or reads them from non-serialized MSRs?  That way we
could do this to all of the 64-bit fast path returns.

P.P.S.  It's kind of tempting to set up a little userspace trampoline that does:

popq %r11
popq %rcx
ret 128

Page faults could increment rsp by 128 (preserving the red zone), push
rip, rcx, and r11, and return via sysretq to the trampoline.  This
would presumably save 80ns on Sandy Bridge :)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-07-29  3:30 ` Andrew Lutomirski
@ 2011-07-29  7:24   ` Dave Chinner
  2011-07-29 12:17     ` Andrew Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2011-07-29  7:24 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: linux-kernel, mingo

On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote:
> On Thu, Jul 28, 2011 at 9:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> > Hi folks,
> >
> > It-s merge window again, which means I'm doing my usual "where did
> > the XFS performance go" bisects again. The usual workload:
> >
> 
> [...]
> 
> >
> > The completion time over multiple runs is ~8m10s +/-5s, and the user
> > CPU time is roughly 245s +/-5s
> >
> > Enter 5cec93c2 ("x86-64: Emulate legacy vsyscalls") and the result
> > ends up at:
> >
> >     0     48000000            0     108975.2          9507483
> >     0     48800000            0     114676.5          8604471
> >     0     49600000            0      98062.0          8921525
> >     0     50400000            0     103864.7          8218302
> > 287.35user 2933.90system 8:33.11elapsed 627%CPU (0avgtext+0avgdata 82560maxresident)k
> > 0inputs+0outputs (1664major+2603457minor)pagefaults 0swaps
> >
> > Noticable slow wall time with more variance - it's at 8m30s +/-10s,
> > and the user CPU time is at 290s +/-5s. So the benchmark is slower to
> > complete and consumes 20% more CPU in userspace. The following commit
> > c971294 x86-64: ("Improve vsyscall emulation CS and RIP handling")
> > also contributes to the slowdown a bit.
> 
> I'm surprised that the second commit had any effect.

It's probably noise - the overhead numbers dropping out of fsmark
seemed to be slightly higher and the added user time about another
5s on average on top of the 50s the previous patch added. With it
taking 15-20min as bisect step, I didn't run more than one test at
each commit - the bisect took 4.5 hours as it was to find the commit
responsible.

So, probably just noise.

> > FYI, fs_mark does a lot of gettimeofday() calls - one before and
> > after every syscall that does filesystem work so it can calculate
> > the syscall times and the amount of time spent not doing syscalls.
> > I'm assuming this is the problem based on the commit message.
> > Issuing hundreds of thousands of getimeofday calls per second spread
> > across multiple CPUs is not uncommon, especially in benchmark or
> > performance measuring software. If that is the cause, then these
> > commits add -significant- overhead to that process.
> 
> I put some work into speeding up vdso timing in 3.0.  As of Linus' tree now:
> 
> # test_vsyscall bench
> Benchmarking  syscall gettimeofday      ...   7068000 loops in
> 0.50004s =   70.75 nsec / loop
> Benchmarking     vdso gettimeofday      ...  23868000 loops in
> 0.50002s =   20.95 nsec / loop
> Benchmarking vsyscall gettimeofday      ...   2106000 loops in
> 0.50004s =  237.44 nsec / loop

How does that compare to 3.0 before these changes? No point telling
me how it performs without something to compare it to and it doesn't
tell me if gettimeofday actually slowed down or not...

> So clock_gettime(CLOCK_MONOTONIC) is faster, more correct, and more
> precise than gettimeofday.  IMO you should fix your benchmark :)

So you're going to say that to everyone who currently uses
gettimeofday() a lot? ;)

> More seriously, though, I think it's a decent tradeoff to slow down
> some extremely vsyscall-heavy legacy workloads to remove the last bit
> of nonrandomized executable code.  The only way this should show up to
> any significant extent is on modern rdtsc-using systems that make a
> huge number of vsyscalls.  On older machines, even the cost of the
> trap should be smallish compared to the cost of HPET / acpi_pm access.
> 
> >
> > Assuming this is the problem, can this be fixed without requiring
> > the whole world having to wait for the current glibc dev tree to
> > filter down into distro repositories?
> 
> How old is your glibc?  gettimeofday has used the vdso since:

It's 2.11 on the test machine, whatever that translates to. I
haven't really changed the base userspace for about 12 months
because if I do I invalidate all my historical benchmark results
that I use for comparisons.

If I have to upgrade it to something more recent (I note that the
current libc6 is 2.13 in debian unstable) then I will but there's
going to be plenty of people that see this if 2.11 is not recent
enough....

> speeds up the gettimeofday emulated vsyscall from 237 ns to 157 ns.

I've still got nothing to compare that against... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-07-29  7:24   ` Dave Chinner
@ 2011-07-29 12:17     ` Andrew Lutomirski
  2011-07-29 13:26       ` Andrew Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Lutomirski @ 2011-07-29 12:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, mingo

On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote:

>> > FYI, fs_mark does a lot of gettimeofday() calls - one before and
>> > after every syscall that does filesystem work so it can calculate
>> > the syscall times and the amount of time spent not doing syscalls.
>> > I'm assuming this is the problem based on the commit message.
>> > Issuing hundreds of thousands of getimeofday calls per second spread
>> > across multiple CPUs is not uncommon, especially in benchmark or
>> > performance measuring software. If that is the cause, then these
>> > commits add -significant- overhead to that process.
>>
>> I put some work into speeding up vdso timing in 3.0.  As of Linus' tree now:
>>
>> # test_vsyscall bench
>> Benchmarking  syscall gettimeofday      ...   7068000 loops in
>> 0.50004s =   70.75 nsec / loop
>> Benchmarking     vdso gettimeofday      ...  23868000 loops in
>> 0.50002s =   20.95 nsec / loop
>> Benchmarking vsyscall gettimeofday      ...   2106000 loops in
>> 0.50004s =  237.44 nsec / loop
>
> How does that compare to 3.0 before these changes? No point telling
> me how it performs without something to compare it to and it doesn't
> tell me if gettimeofday actually slowed down or not...

3.0 would have identical syscall performance and very nearly identical
vdso performance (the code is identical but there could be slightly
different icache behavior).  3.0's vsyscall would have taken ~22 ns on
this hardware.

>
>> So clock_gettime(CLOCK_MONOTONIC) is faster, more correct, and more
>> precise than gettimeofday.  IMO you should fix your benchmark :)
>
> So you're going to say that to everyone who currently uses
> gettimeofday() a lot? ;)

Actually, you're the first one to notice.  I'm hopeful that no
non-benchmark workloads will see a significant effect.

>
>> More seriously, though, I think it's a decent tradeoff to slow down
>> some extremely vsyscall-heavy legacy workloads to remove the last bit
>> of nonrandomized executable code.  The only way this should show up to
>> any significant extent is on modern rdtsc-using systems that make a
>> huge number of vsyscalls.  On older machines, even the cost of the
>> trap should be smallish compared to the cost of HPET / acpi_pm access.
>>
>> >
>> > Assuming this is the problem, can this be fixed without requiring
>> > the whole world having to wait for the current glibc dev tree to
>> > filter down into distro repositories?
>>
>> How old is your glibc?  gettimeofday has used the vdso since:
>
> It's 2.11 on the test machine, whatever that translates to. I
> haven't really changed the base userspace for about 12 months
> because if I do I invalidate all my historical benchmark results
> that I use for comparisons.

2.11 is from 2009 and appears to contain that commit.  Does your
workload call time() very frequently?  That's the largest slowdown.
With the old code, time() took 4-5 ns and with the new code time() is
about as slow as gettimeofday().  I suggested having a config option
to allow time() to stay fast until glibc 2.14 became widespread, but a
few other people disagreed.

>
> If I have to upgrade it to something more recent (I note that the
> current libc6 is 2.13 in debian unstable) then I will but there's
> going to be plenty of people that see this if 2.11 is not recent
> enough....

If it's time(), that won't help.

>
>> speeds up the gettimeofday emulated vsyscall from 237 ns to 157 ns.
>
> I've still got nothing to compare that against... :/

~22 ns before the changes.

Note that this is only on Sandy Bridge.  The overhead of syscalls and
traps is much higher on Nehalem hardware, and I haven't done much
testing on other machines.

On Nehalem with HPET on 3.1-ish code, it looks like:

Benchmarking  syscall gettimeofday ...    612000 loops in 0.50076s =
818.23 nsec / loop
Benchmarking     vdso gettimeofday ...    832000 loops in 0.50032s =
601.34 nsec / loop
Benchmarking vsyscall gettimeofday ...    457000 loops in 0.50056s =
1095.32 nsec / loop


With acpi_pm, it's:

Benchmarking  syscall gettimeofday ...    377000 loops in 0.50007s =
1326.44 nsec / loop
Benchmarking     vdso gettimeofday ...    377000 loops in 0.50112s =
1329.24 nsec / loop
Benchmarking vsyscall gettimeofday ...    316000 loops in 0.50036s =
1583.42 nsec / loop

the difference is almost gone because acpi_pm issues a syscall or trap
no matter how you issue the gettimeofday call.

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-07-29 12:17     ` Andrew Lutomirski
@ 2011-07-29 13:26       ` Andrew Lutomirski
  2011-07-31 11:01         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Lutomirski @ 2011-07-29 13:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, mingo

On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote:
>>> >
>>> > Assuming this is the problem, can this be fixed without requiring
>>> > the whole world having to wait for the current glibc dev tree to
>>> > filter down into distro repositories?
>>>
>>> How old is your glibc?  gettimeofday has used the vdso since:
>>
>> It's 2.11 on the test machine, whatever that translates to. I
>> haven't really changed the base userspace for about 12 months
>> because if I do I invalidate all my historical benchmark results
>> that I use for comparisons.
>
> 2.11 is from 2009 and appears to contain that commit.  Does your
> workload call time() very frequently?  That's the largest slowdown.
> With the old code, time() took 4-5 ns and with the new code time() is
> about as slow as gettimeofday().  I suggested having a config option
> to allow time() to stay fast until glibc 2.14 became widespread, but a
> few other people disagreed.

*sigh*

fs_mark: fs_mark.o lib_timing.o
        ${CC} -static -o fs_mark fs_mark.o lib_timing.o

Even brand-new glibc still issues vsyscalls when statically linked,
and Ulrich has said [1] that he doesn't care that much about
performance of statically linked code.

How bad would it be to just remove the -static from the makefile?

[1] http://sourceware.org/bugzilla/show_bug.cgi?id=12813

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-07-29 13:26       ` Andrew Lutomirski
@ 2011-07-31 11:01         ` Dave Chinner
  2011-08-01 12:29           ` Andrew Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2011-07-31 11:01 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: linux-kernel, mingo

On Fri, Jul 29, 2011 at 09:26:19AM -0400, Andrew Lutomirski wrote:
> On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote:
> >>> >
> >>> > Assuming this is the problem, can this be fixed without requiring
> >>> > the whole world having to wait for the current glibc dev tree to
> >>> > filter down into distro repositories?
> >>>
> >>> How old is your glibc?  gettimeofday has used the vdso since:
> >>
> >> It's 2.11 on the test machine, whatever that translates to. I
> >> haven't really changed the base userspace for about 12 months
> >> because if I do I invalidate all my historical benchmark results
> >> that I use for comparisons.
> >
> > 2.11 is from 2009 and appears to contain that commit.  Does your
> > workload call time() very frequently?  That's the largest slowdown.
> > With the old code, time() took 4-5 ns and with the new code time() is
> > about as slow as gettimeofday().  I suggested having a config option
> > to allow time() to stay fast until glibc 2.14 became widespread, but a
> > few other people disagreed.
> 
> *sigh*
> 
> fs_mark: fs_mark.o lib_timing.o
>         ${CC} -static -o fs_mark fs_mark.o lib_timing.o
> 
> Even brand-new glibc still issues vsyscalls when statically linked,
> and Ulrich has said [1] that he doesn't care that much about
> performance of statically linked code.
> 
> How bad would it be to just remove the -static from the makefile?

Results in 270s +-5s user CPU time, so user CPU time is still ~10%
up on 3.0 numbers.  IOWs, a non-static link roughly halves the
regression but doesn't get rid of it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-07-31 11:01         ` Dave Chinner
@ 2011-08-01 12:29           ` Andrew Lutomirski
  2011-08-01 13:25             ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Lutomirski @ 2011-08-01 12:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, mingo

On Sun, Jul 31, 2011 at 7:01 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Jul 29, 2011 at 09:26:19AM -0400, Andrew Lutomirski wrote:
>> On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>> > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote:
>> >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote:
>> >>> >
>> >>> > Assuming this is the problem, can this be fixed without requiring
>> >>> > the whole world having to wait for the current glibc dev tree to
>> >>> > filter down into distro repositories?
>> >>>
>> >>> How old is your glibc?  gettimeofday has used the vdso since:
>> >>
>> >> It's 2.11 on the test machine, whatever that translates to. I
>> >> haven't really changed the base userspace for about 12 months
>> >> because if I do I invalidate all my historical benchmark results
>> >> that I use for comparisons.
>> >
>> > 2.11 is from 2009 and appears to contain that commit.  Does your
>> > workload call time() very frequently?  That's the largest slowdown.
>> > With the old code, time() took 4-5 ns and with the new code time() is
>> > about as slow as gettimeofday().  I suggested having a config option
>> > to allow time() to stay fast until glibc 2.14 became widespread, but a
>> > few other people disagreed.
>>
>> *sigh*
>>
>> fs_mark: fs_mark.o lib_timing.o
>>         ${CC} -static -o fs_mark fs_mark.o lib_timing.o
>>
>> Even brand-new glibc still issues vsyscalls when statically linked,
>> and Ulrich has said [1] that he doesn't care that much about
>> performance of statically linked code.
>>
>> How bad would it be to just remove the -static from the makefile?
>
> Results in 270s +-5s user CPU time, so user CPU time is still ~10%
> up on 3.0 numbers.  IOWs, a non-static link roughly halves the
> regression but doesn't get rid of it.

Are you sure?  I stuck a trace event in do_emulate_vsyscall and it's
not getting hit at all in fs_mark, at least on my system.  I'll send
out the patch tomorrow.

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20%
  2011-08-01 12:29           ` Andrew Lutomirski
@ 2011-08-01 13:25             ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2011-08-01 13:25 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: linux-kernel, mingo

On Mon, Aug 01, 2011 at 08:29:47AM -0400, Andrew Lutomirski wrote:
> On Sun, Jul 31, 2011 at 7:01 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Jul 29, 2011 at 09:26:19AM -0400, Andrew Lutomirski wrote:
> >> On Fri, Jul 29, 2011 at 8:17 AM, Andrew Lutomirski <luto@mit.edu> wrote:
> >> > On Fri, Jul 29, 2011 at 3:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> >> On Thu, Jul 28, 2011 at 11:30:49PM -0400, Andrew Lutomirski wrote:
> >> >>> >
> >> >>> > Assuming this is the problem, can this be fixed without requiring
> >> >>> > the whole world having to wait for the current glibc dev tree to
> >> >>> > filter down into distro repositories?
> >> >>>
> >> >>> How old is your glibc?  gettimeofday has used the vdso since:
> >> >>
> >> >> It's 2.11 on the test machine, whatever that translates to. I
> >> >> haven't really changed the base userspace for about 12 months
> >> >> because if I do I invalidate all my historical benchmark results
> >> >> that I use for comparisons.
> >> >
> >> > 2.11 is from 2009 and appears to contain that commit.  Does your
> >> > workload call time() very frequently?  That's the largest slowdown.
> >> > With the old code, time() took 4-5 ns and with the new code time() is
> >> > about as slow as gettimeofday().  I suggested having a config option
> >> > to allow time() to stay fast until glibc 2.14 became widespread, but a
> >> > few other people disagreed.
> >>
> >> *sigh*
> >>
> >> fs_mark: fs_mark.o lib_timing.o
> >>         ${CC} -static -o fs_mark fs_mark.o lib_timing.o
> >>
> >> Even brand-new glibc still issues vsyscalls when statically linked,
> >> and Ulrich has said [1] that he doesn't care that much about
> >> performance of statically linked code.
> >>
> >> How bad would it be to just remove the -static from the makefile?
> >
> > Results in 270s +-5s user CPU time, so user CPU time is still ~10%
> > up on 3.0 numbers.  IOWs, a non-static link roughly halves the
> > regression but doesn't get rid of it.
> 
> Are you sure?  I stuck a trace event in do_emulate_vsyscall and it's
> not getting hit at all in fs_mark, at least on my system.  I'll send
> out the patch tomorrow.

It may be other changes to kernel code that are causing the rest of
the regresssion. Kernel code that blows the CPU caches (e.g. direct
reclaim LRU scanning) can have a major effect of the userspace CPU
time, so it's probably some secondary effect like that I'm seeing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-08-01 13:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-29  1:38 [3.0-rc0 Regression]: legacy vsyscall emulation increases user CPU time by 20% Dave Chinner
2011-07-29  3:30 ` Andrew Lutomirski
2011-07-29  7:24   ` Dave Chinner
2011-07-29 12:17     ` Andrew Lutomirski
2011-07-29 13:26       ` Andrew Lutomirski
2011-07-31 11:01         ` Dave Chinner
2011-08-01 12:29           ` Andrew Lutomirski
2011-08-01 13:25             ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox