Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Shaohua Li <shli@fb.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	X86 ML <x86@kernel.org>, <kernel-team@fb.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO
Date: Wed, 10 Dec 2014 13:57:13 -0800	[thread overview]
Message-ID: <20141210215713.GA30230@devbig257.prn2.facebook.com> (raw)
In-Reply-To: <CALCETrUU-vb34550mbE-yqoWYrzVNarVf++Logt0gmSTh_Cg2w@mail.gmail.com>

On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <shli@fb.com> wrote:
> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
> > use the following method to compute the thread cpu time:
> 
> I like the idea, and I like making this type of profiling fast.  I
> don't love the implementation because it's an information leak (maybe
> we don't care) and it's ugly.
> 
> The info leak could be fixed completely by having a per-process array
> instead of a global array.  That's currently tricky without wasting
> memory, but it could be created on demand if we wanted to do that,
> once my vvar .fault patches go in (assuming they do -- I need to ping
> the linux-mm people).

those info leak really doesn't matter. But we need the global array
anyway. The context switch detection should be per-cpu data and should
be able to access in remote cpus.

> As for ugliness, it seems to me that there really ought to be a better
> way to do this.  What we really want is a per-thread vvar-like thing.
> Any code that can use TLS can do this trivially, but the vdso,
> unfortunately, has no straightforward way to use TLS.
> 
> If we could rely on something like TSX (ha!), then this becomes straightforward:
> 
> xbegin()
> rdtscp
> read params
> xcommit()
> 
> But we can't do that for the next decade or so, obviously.
> 
> If we were willing to limit ourselves to systems with rdwrgsbase, then
> we could do:
> 
> save gs and gsbase
> mov $__VDSO_CPU,%gs
> mov %gs:(base our array),%whatever
> restore gs and gsbase
> 
> (We actually could do this if we make arch_prctl disable this feature
> the first time a process tries to set the gs base.)
> 
> If we were willing to limit ourselves to at most 2^20 threads per
> process, then we could assign each thread a 20-bit index within its
> process with some LSL trickery.  Then we could have the vdso look up
> the thread in a per-process array with one element per thread.  This
> would IMO be the winning approach if we think we'd ever extend the set
> of per-thread vvar things.  (For the 32-bit VDSO, this is trivial --
> just use a far pointer.)
> 
> If we has per-cpu fixmap entries, we'd use that.  Unfortunately, that
> would be a giant mess to implement, and it would be a slowdown for
> some workloads (and a speedup for others - hmm).
> 
> Grumble.  Thoughts?
> 
> --Andy
> 
> >
> >     t0 = process start
> >     t1 = most recent context switch time
> >     t2 = time at which the vsyscall is invoked
> >
> >     thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
> >                 = current->se.sum_exec_runtime + now - sched_clock()
> >
> > At context switch time We stash away
> >
> >     adj_sched_time = sum_exec_runtime - sched_clock()
> >
> > in a per-cpu struct in the VVAR page and then compute
> >
> >     thread_cpu_time = adj_sched_time + now
> >
> > All computations are done in nanosecs on systems where TSC is stable. If
> > TSC is unstable, we fallback to a regular syscall.
> >     Benchmark data:
> >
> >     for (i = 0; i < 100000000; i++) {
> >             clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> >             sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
> >     }
> >
> >     Baseline:
> >     real    1m3.428s
> >     user    0m5.190s
> >     sys     0m58.220s
> >
> >     patched:
> >     real    0m4.912s
> >     user    0m4.910s
> >     sys     0m0.000s
> >
> > This should speed up profilers that need to query thread cpu time a lot
> > to do fine-grained timestamps.
> >
> > No statistically significant regression was detected on x86_64 context
> > switch code. Most archs that don't support vsyscalls will have this code
> > disabled via jump labels.
> >
> > Cc: Andy Lutomirski <luto@amacapital.net>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Signed-off-by: Kumar Sundararajan <kumar@fb.com>
> > Signed-off-by: Arun Sharma <asharma@fb.com>
> > Signed-off-by: Chris Mason <clm@fb.com>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  arch/x86/include/asm/vdso.h    |  9 +++++++-
> >  arch/x86/kernel/tsc.c          | 14 ++++++++++++
> >  arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
> >  arch/x86/vdso/vma.c            |  4 ++++
> >  kernel/sched/core.c            |  3 +++
> >  5 files changed, 78 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> > index 42d6d2c..cbbab3a 100644
> > --- a/arch/x86/include/asm/vdso.h
> > +++ b/arch/x86/include/asm/vdso.h
> > @@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
> >  struct vdso_percpu_data {
> >         /* layout: | cpu ID | context switch count | */
> >         unsigned long cpu_cs_count;
> > +
> > +       unsigned long long adj_sched_time;
> > +       unsigned long long cyc2ns_offset;
> > +       unsigned long cyc2ns;
> 
> Having two offsets seems unnecessary.
> 
> >  } ____cacheline_aligned;
> >
> >  struct vdso_data {
> > -       int dummy;
> > +       unsigned int thread_cputime_disabled;
> >         struct vdso_percpu_data vpercpu[0];
> >  };
> >  extern struct vdso_data vdso_data;
> >
> > +struct static_key;
> > +extern struct static_key vcpu_thread_cputime_enabled;
> > +
> >  #ifdef CONFIG_SMP
> >  #define VDSO_CS_COUNT_BITS \
> >         (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index b7e50bb..83f8091 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -21,6 +21,7 @@
> >  #include <asm/hypervisor.h>
> >  #include <asm/nmi.h>
> >  #include <asm/x86_init.h>
> > +#include <asm/vdso.h>
> >
> >  unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
> >  EXPORT_SYMBOL(cpu_khz);
> > @@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> >         data->cyc2ns_offset = ns_now -
> >                 mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
> >
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > +       vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
> > +       vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
> > +#endif
> > +
> >         cyc2ns_write_end(cpu, data);
> >
> >  done:
> > @@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
> >                 tsc_unstable = 1;
> >                 clear_sched_clock_stable();
> >                 disable_sched_clock_irqtime();
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > +               vdso_data.thread_cputime_disabled = 1;
> > +               static_key_slow_dec(&vcpu_thread_cputime_enabled);
> > +#endif
> >                 pr_info("Marking TSC unstable due to %s\n", reason);
> >                 /* Change only the rating, when not registered */
> >                 if (clocksource_tsc.mult)
> > @@ -1202,6 +1212,10 @@ void __init tsc_init(void)
> >
> >         tsc_disabled = 0;
> >         static_key_slow_inc(&__use_tsc);
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > +       vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
> > +                       X86_FEATURE_RDTSCP);
> > +#endif
> 
> This is backwards IMO.  Even if this function never runs, the flag
> should be set to disable the feature.  Then you can enable it here.
> 
> >
> >         if (!no_sched_irq_time)
> >                 enable_sched_clock_irqtime();
> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> > index 438b3be..46c03af2 100644
> > --- a/arch/x86/vdso/vclock_gettime.c
> > +++ b/arch/x86/vdso/vclock_gettime.c
> > @@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
> >         smp_rmb();
> >         return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
> >  }
> > +
> > +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> > +notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
> > +                                       unsigned long long scale,
> > +                                       unsigned long long offset)
> > +{
> > +       unsigned long long ns = offset;
> > +       ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
> > +       return ns;
> > +}
> > +
> > +notrace static unsigned long do_thread_cpu_time(void)
> > +{
> > +       unsigned int p;
> > +       u_int64_t tscval;
> > +       unsigned long long adj_sched_time;
> > +       unsigned long long scale;
> > +       unsigned long long offset;
> > +       const struct vdso_data *vp = &VVAR(vdso_data);
> > +       int cpu;
> > +       unsigned long cs;
> > +
> > +       do {
> > +               cs = __vdso_get_context_switch_count();
> > +
> > +               rdtscpll(tscval, p);
> > +               cpu = p & VGETCPU_CPU_MASK;
> > +               adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
> > +               scale = vp->vpercpu[cpu].cyc2ns;
> > +               offset = vp->vpercpu[cpu].cyc2ns_offset;
> > +
> > +       } while (unlikely(cs != __vdso_get_context_switch_count()));
> > +
> > +       return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;
> 
> You're literally adding two offsets together here, although it's
> obscured by the abstraction of __cycles_2_ns.

I don't understand 'two offsets' stuff, could you please give more
details?

> Also, if you actually expand this function out, it's not optimized
> well.  You're doing, roughly:
> 
> getcpu
> read cs count
> rdtscp
> read clock params
> getcpu
> read cs count
> repeat if necessary
> 
> You should need at most two operations that read the cpu number.  You could do:
> 
> getcpu
> read cs count
> rdtscp
> read clock params
> barrier()
> check that both cpu numbers agree and that the cs count hasn't changed
> 
> All those smp barriers should be unnecessary, too.  There's no SMP here.
> 
> Also, if you do this, then the high bits of the cs count can be removed.
> It may also make sense to bail and use the fallback after a couple
> failures.  Otherwise, in some debuggers, you will literally never
> finish the loop.

ok, makes sense.

> I am curious, though: why are you using sched clock instead of
> CLOCK_MONOTONIC?  Is it because you can't read CLOCK_MONOTONIC in the
> scheduler?

The sched clock is faster, I guess.

Thanks,
Shaohua

next prev parent reply	other threads:[~2014-12-10 21:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-08  3:03 [PATCH 1/3] X86: make VDSO data support multiple pages Shaohua Li
2014-12-08  3:03 ` [PATCH 2/3] X86: add a generic API to let vdso code detect context switch Shaohua Li
2014-12-10 18:38   ` Andy Lutomirski
2014-12-10 18:51     ` Shaohua Li
2014-12-10 19:11       ` Andy Lutomirski
2014-12-10 19:41         ` Shaohua Li
2014-12-08  3:03 ` [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO Shaohua Li
2014-12-10 19:10   ` Andy Lutomirski
2014-12-10 21:57     ` Shaohua Li [this message]
2014-12-10 22:13       ` Andy Lutomirski
2014-12-10 22:56         ` Shaohua Li
2014-12-10 23:06           ` Andy Lutomirski
2014-12-11  6:36             ` Ingo Molnar
2014-12-15 18:36               ` Chris Mason
2014-12-15 18:55                 ` Andy Lutomirski
2014-12-10 18:36 ` [PATCH 1/3] X86: make VDSO data support multiple pages Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141210215713.GA30230@devbig257.prn2.facebook.com \
    --to=shli@fb.com \
    --cc=hpa@zytor.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.