public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Thomas Gleixner <tglx@linutronix.de>,
	David Miller <davem@davemloft.net>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	ltt-dev@lists.casi.polymtl.ca
Subject: Re: [RFC patch 15/15] LTTng timestamp x86
Date: Fri, 17 Oct 2008 16:23:13 -0400	[thread overview]
Message-ID: <20081017202313.GA13597@Krystal> (raw)
In-Reply-To: <57C9024A16AD2D4C97DC78E552063EA35334594F@orsmsx505.amr.corp.intel.com>

* Luck, Tony (tony.luck@intel.com) wrote:
> > Hrm, on such systems
> > - *large* amount of cpus
> > - no synchronized TSCs
> >
> > What would be the best approach to order events ?
> 
> There isn't a perfect solution for this.  My feeling is
> that your best hope is with per-cpu buffers logged with
> the local TSC ... together with some fancy heuristics to
> post-process the logs to come up with the best approximation
> to the actual ordering.
> 
> If you have a tight upper bound estimate for the
> errors in converting from "per-cpu" TSC values to "global
> system time" then the post processing tool will be able
> to identify events for which the order is uncertain.
>

The only problem I see with "fancy heuristics" regarding the time base
is that when we detect that something is going wrong in the kernel or in
a userspace program, the *very last* thing we want to do is to doubt
about the reliability of the time source. When a problematic situation
is detected, it makes a huge difference whether this information can be
trusted or not. I've seen much simpler algorithms in the past (I'm
referring to the original LTT heartbeat here) which were told to be
plain simple but ended up being buggy and unreliable in rare
corner-cases (it did not take into account interrupt latency). After
fixing the main problems, I decided to start all over from scratch,
because unreliable timestamps means unreliable traces and this is not
something I am willing to provide.


> > Do you think we should consider using HPET, event though it's
> > painfully slow ? Would it be faster than cache-line bouncing
> > on such large boxes ? With a frequency around 10MHz, that
> > would give a 100ns precision, which should be enough
> > to order events.
> 
> This sounds like a poor choice.  Makes all traces very
> slow.  100ns precision isn't all that good ... we can
> probably do almost as well estimating the delta between
> TSC on different cpus.
> 
> -Tony
> 

100ns is not bad at all actually, especially given we don't plan to
require a memory barrier to be issued around the timestamp counter
reads. Memory read/writes can easily be reordered so they cause timing
skew in the order of 100ns. Also, just the TSC frequency drift and
imprecision of the TSC synchronization even when they are synchronized
(which is typically one cache line transfer delay when the TSCs are not
synchronized by the BIOS/mobo) is also in the order of 100ns. So sorry, I
disagree and think 100ns is actually the kind of precision we can expect
even from TSC reads.

Having read a lot about the subtle timestamp counter bugs one can find
in Intel and AMD boxes (gross summary of my findings here : 
http://ltt.polymtl.ca/svn/trunk/lttv/doc/developer/tsc.txt), I think
there is no reliable way to give an upper bound on the timing
inaccurary, even with heroic measures trying to map the specific bugs of
each of those CPUs, when you have stuff like the southbridge temperature
throttling slowing down your CPU clock without notifying the kernel. And
as a said above, the timestamping code should be _very_ _very_ simple,
given that the first thing a kernel developer will point his finger at
when a tracer discovers a bug in his code is the tracer itself. So let's
save everyone precious time and make this code easy to review. :-)

So we are talking about performance impact of time base reads. Let's
look at some interesting numbers :

On my x86_64 box
model name  : Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping  : 6
cpu MHz   : 2000.060
(dual quad-core, *not* NUMA)

Cycles to read (from 10000 loops, on a single CPU) :

get_cycles :                                                         60
cpuid + get_cycles + cpuid (with rdtsc_barrier) :                    77
ltt_get_timestamp_64() with synchronized TSC :                       79
ltt_get_timestamp_64() with non-synchronized TSC (cache-hot) :      163
ltt_get_timestamp_64() with non-synchronized TSC (after clflush) :  310
   (just doing the clflush takes 68 cycles, has been substracted from
   the previous result)
HPET :                                                              945

So if we have 512 processors doing timestamp reads like crazy, we can
suppose the execution to be serialized by cacheline transfer operations
from cpu to cpu. Therefore, assuming a worse-case scenario where all the
timestamp reads cause a cache line transfer, the 310 cycles (upper
bound) it takes to do the cmpxchg, with CPUs running at 2.0GHz, means
that we can do 6 451 612 timestamp reads per second on the overall
system. On 512 nodes, that means we can do 12 600 timestamp
reads/second/cpu.

Compared to this, HPET would offer a slower time base read (945 cycles
per read is fairly slow, which gives 2 116 402 time stamp read per
second), but if this mmio read is done in parallel across CPUs (I
don't see any reason why it should hold the bus exclusively for a simple
read.. ?), then it would scale much better so we could expect about 2.1M
timestamp read/second/cpu.

I guess the cache-line bouncing approach would get much worse with NUMA
systems, and in that case HPET could become increasingly interesting.

To give an order of magnitude, I expect some worse-case scenarios to be
around 8MB/s/cpu when tracing stuff like lockdep in circular per-cpu
memory buffers. With an average of, say, 16 bytes per event, including
the event header and payload, that would mean 524 288 events/second/cpu.
In that case, using the cache-line bouncing approach on a 512 nodes box
would simply kill the system, but if the HPET reads does not imply
serialization, time base reads would add a 25% performance loss in this
utterly extreme case, which is acceptable given this is a best-effort.
Compared to this, the TSC-based solution (given we have synchronized
TSCs) would add a 2% performance hit.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

  reply	other threads:[~2008-10-17 20:28 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 01/15] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 02/15] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-17  2:48   ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
2008-10-17  2:57     ` David Miller
2008-10-16 23:27 ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-17  0:26   ` Paul Mackerras
2008-10-17  0:43     ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
2008-10-17  0:54       ` Paul Mackerras
2008-10-17  1:42       ` David Miller
2008-10-17  2:08         ` Mathieu Desnoyers
2008-10-17  2:33           ` David Miller
2008-10-16 23:27 ` [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
2008-10-26 10:18   ` Ralf Baechle
2008-10-26 20:39     ` Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 06/15] LTTng build Mathieu Desnoyers
2008-10-17  8:10   ` KOSAKI Motohiro
2008-10-17 16:18     ` Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 07/15] LTTng timestamp Mathieu Desnoyers
2008-10-17  8:15   ` KOSAKI Motohiro
2008-10-17 16:23     ` Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 08/15] LTTng - Timestamping Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 09/15] LTTng mips export hpt frequency Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 10/15] LTTng timestamp mips Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 11/15] LTTng timestamp powerpc Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 12/15] LTTng timestamp sparc64 Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 13/15] LTTng timestamp sh Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 14/15] LTTng - TSC synchronicity test Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 15/15] LTTng timestamp x86 Mathieu Desnoyers
2008-10-17  0:08   ` Linus Torvalds
2008-10-17  0:12     ` Linus Torvalds
2008-10-17  1:28     ` Mathieu Desnoyers
2008-10-17  2:19       ` Luck, Tony
2008-10-17 17:25         ` Steven Rostedt
2008-10-17 18:08           ` Luck, Tony
2008-10-17 18:42             ` Mathieu Desnoyers
2008-10-17 18:58               ` Luck, Tony
2008-10-17 20:23                 ` Mathieu Desnoyers [this message]
2008-10-17 23:52                   ` Luck, Tony
2008-10-18 17:01                     ` Mathieu Desnoyers
2008-10-18 17:35                       ` Linus Torvalds
2008-10-18 17:50                         ` Ingo Molnar
2008-10-22 16:19                           ` Mathieu Desnoyers
2008-10-22 15:53                         ` Mathieu Desnoyers
2008-10-20 18:07                       ` Luck, Tony
2008-10-22 16:51                         ` Mathieu Desnoyers
2008-10-17 19:17               ` Steven Rostedt
2008-10-20 20:10               ` Linus Torvalds
2008-10-20 21:38                 ` john stultz
2008-10-20 22:06                   ` Linus Torvalds
2008-10-20 22:17                     ` Ingo Molnar
2008-10-20 22:29                     ` H. Peter Anvin
2008-10-21 18:10                       ` Bjorn Helgaas
2008-10-23 15:47                         ` Linus Torvalds
2008-10-23 16:39                           ` H. Peter Anvin
2008-10-23 21:54                           ` Paul Mackerras
2008-10-20 23:47                     ` john stultz
2008-10-22 17:05                 ` Mathieu Desnoyers
2008-10-17 19:36         ` Christoph Lameter
2008-10-17  7:59 ` [RFC patch 00/15] Tracer Timestamping Peter Zijlstra
2008-10-20 20:25   ` Mathieu Desnoyers
2008-10-21  0:20     ` Nicolas Pitre
2008-10-21  1:32       ` Mathieu Desnoyers
2008-10-21  2:32         ` Nicolas Pitre
2008-10-21  4:05           ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081017202313.GA13597@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=hpa@zytor.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ltt-dev@lists.casi.polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox