From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Hellwig <hch@infradead.org>,
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>,
Gregory Haskins <ghaskins@novell.com>,
Arnaldo Carvalho de Melo <acme@ghostprotocols.net>,
Thomas Gleixner <tglx@linutronix.de>,
Tim Bird <tim.bird@am.sony.com>, Sam Ravnborg <sam@ravnborg.org>,
"Frank Ch. Eigler" <fche@redhat.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
John Stultz <johnstul@us.ibm.com>,
Arjan van de Ven <arjan@infradead.org>,
Steven Rostedt <srostedt@redhat.com>
Subject: Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation
Date: Mon, 4 Feb 2008 13:40:23 -0800 [thread overview]
Message-ID: <20080204214023.GC29646@linux.vnet.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0802041153420.4119@gandalf.stny.rr.com>
On Mon, Feb 04, 2008 at 12:09:00PM -0500, Steven Rostedt wrote:
>
> Hi Paul,
>
> First I want to say, "Thank you", for taking the time to explain this in
> considerable detail. But I still have some minor questions.
>
> (Even though you already convinced me, but I still want full
> understanding ;-)
OK, will see what I can do...
> On Sat, 2 Feb 2008, Paul E. McKenney wrote:
>
> > Yep, you have dependencies, so something like the following:
> >
> > initial state:
> >
> > struct foo {
> > int a;
> > };
> > struct foo x = { 0 };
> > struct foo y = { 0 };
> > struct foo *global_p = &y;
> > /* other variables are appropriately declared auto variables */
> >
> > /* No kmalloc() or kfree(), hence no RCU grace periods. */
> > /* In the terminology of http://lwn.net/Articles/262464/, we */
> > /* are doing only publish-subscribe, nothing else. */
> >
> > writer:
> >
> > x.a = 1;
> > smp_wmb(); /* or smp_mb() */
> > global_p = &x;
> >
> > reader:
> >
> > p = global_p;
> > ta = p->a;
> >
> > Both Alpha and aggressive compiler optimizations can result in the reader
> > seeing the new value of the pointer (&x) but the old value of the field
> > (0). Strange but true. The fix is as follows:
> >
> > reader:
> >
> > p = global_p;
> > smp_read_barrier_depends(); /* or use rcu_dereference() */
> > ta = p->a;
> >
> > So how can this happen? First note that if smp_read_barrier_depends()
> > was unnecessary in this case, it would be unnecessary in all cases.
> >
> > Second, let's start with the compiler. Suppose that a highly optimizing
> > compiler notices that in almost all cases, the reader finds p==global_p.
> > Suppose that this compiler also notices that one of the registers (say
> > r1) almost always contains this expected value of global_p, and that
> > cache pressure ensures that an actual load from global_p almost always
> > generates an expensive cache miss. Such a compiler would be within its
> > rights (as defined by the C standard) to generate code assuming that r1
> > already had the right value, while also generating code to validate this
> > assumption, perhaps as follows:
> >
> > r2 = global_p; /* high latency, other things complete meanwhile */
> > ta == r1->a;
> > if (r1 != r2)
> > ta = r2->a;
> >
> > Now consider the following sequence of events on a superscalar CPU:
>
> I think you missed one step here (causing my confusion). I don't want to
> assume so I'll try to put in the missing step:
>
> writer: r1 = p; /* happens to use r1 to store parameter p */
You lost me on this one... The writer has only the following three steps:
writer:
x.a = 1;
smp_wmb(); /* or smp_mb() */
global_p = &x;
Where did the "r1 = p" come from? For that matter, where did "p" come
from?
> > reader: r2 = global_p; /* issued, has not yet completed. */
> > reader: ta = r1->a; /* which gives zero. */
> > writer: x.a = 1;
> > writer: smp_wmb();
> > writer: global_p = &x;
> > reader: r2 = global_p; /* this instruction now completes */
> > reader: if (r1 != r2) /* and these are equal, so we keep bad ta! */
>
> Is that the case?
Ah! Please note that I am doing something unusual here in that I am
working with global variables, as opposed to the normal RCU practice of
dynamically allocating memory. So "x" is just a global struct, not a
pointer to a struct.
> > I have great sympathy with the argument that this level of optimization
> > is simply insane, but the fact is that there are real-world compilers
> > that actually do this sort of thing. In addition, there are cases where
> > the compiler might be able to figure out that a value is constant, thus
> > breaking the dependency chain. This is most common for array references
> > where the compiler might be able to figure out that a given array index
> > is always zero, thus optimizing away the load and the dependency that
> > the programmer might expect to enforce ordering. (I have an example
> > of this down at the end.)
> >
> > This sort of misordering is also done by DEC Alpha hardware, assuming
> > split caches. This can happen if the variable x is in an odd-numbered
> > cache line and the variable global_p is in an even-numbered cache line.
> > In this case, the smp_wmb() affects the memory order, but only within
> > the writing CPU. The ordering can be defeated in the reading CPU as
> > follows:
> >
> > writer: x.a = 1;
> > writer: smp_wmb();
> > writer: global_p = &x;
> > reader: p = global_p;
> > reader: ta = p->a;
> >
> > But the reader's odd-numbered cache shard is loaded
> > down with many queued cacheline invalidation requests,
> > so the old cached version of x.a==0 remains in the
> > reader's cache, so that the reader sees ta==0.
> >
> > In contrast:
> >
> > writer: x.a = 1;
> > writer: smp_wmb();
> > writer: global_p = &x;
> > reader: p = global_p;
> > reader: smp_read_barrier_depends();
> >
> > The above barrier forces all cacheline invalidation
> > requests that have arrived at the reading CPU to be
> > processed before any subsequent reads, including
> > the pending invalidation request for the variable x.
> >
> > reader: ta = p->a;
> >
> > So ta is now guaranteed to be 1, as desired.
>
> Thanks, this is starting to clear things up for me (And scare me away from
> Alpha's)
Of course, fairness would require that it also scare you away from
value-speculation optimizations in compilers. ;-)
Thanx, Paul
> > > > > > Let me explain the situation here.
> > > > > >
> > > > > > We have a single link list called mcount_list that is walked when more
> > > > > > than one function is registered by mcount. Mcount is called at the start
> > > > > > of all C functions that are not annotated with "notrace". When more than
> > > > > > one function is registered, mcount calls a loop function that does the
> > > > > > following:
> > > > > >
> > > > > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > > > > {
> > > > > > struct mcount_ops *op = mcount_list;
> > > > >
> > > > > When thinking RCU, this would be rcu_dereference and imply a read
> > > > > barrier.
> > > > >
> > > > > > while (op != &mcount_list_end) {
> > > > > > op->func(ip, parent_ip);
> > > > > > op = op->next;
> > > > >
> > > > > Same here; the rcu_dereference() would do the read depend barrier.
> > > >
> > > > Specifically:
> > > >
> > > > notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
> > > > {
> > > > struct mcount_ops *op = rcu_dereference(mcount_list);
> > > >
> > > > while (op != &mcount_list_end) {
> > > > op->func(ip, parent_ip);
> > > > op = rcu_dereference(op->next);
> > > >
> > > > This assumes that you are using call_rcu(), synchronize_rcu(), or
> > > > whatever to defer freeing/reuse of the ops structure.
> > >
> > > One special part of this is that the ops structure is never to be freed
> > > (this is documented). It should be a static read-mostly structure.
> > > Since it is not to be freed, I did not export the registered functions to
> > > keep modules from using it. I may later add an export that will cause the
> > > module to increment it's usage count so that it may never be freed.
> > >
> > > There's no guarantees that prevent the func from being called after it was
> > > unregistered, nor should the users of this, ever touch the "next" pointer.
> > >
> > > This makes things easy when you don't need to free ;-)
> >
> > It can indeed make things easier, but it does not help in this case.
> > This memory-ordering problem appears even if you never free anything, as
> > described above. Again, in the terminology laid out in the LWN article
> > at http://lwn.net/Articles/262464/, you are doing a publish-subscribe
> > operation, and it still must be protected.
> >
> > But yes, my comment above about using call_rcu() and friends did in fact
> > incorrectly assume that you were freeing (or otherwise re-using) the
> > data structures.
> >
> > > > > > };
> > > > > > }
> > > > > >
> > > > > > A registered function must already have a "func" filled, and the mcount
> > > > > > register code takes care of "next". It is documented that the calling
> > > > > > function should "never" change next and always expect that the func can be
> > > > > > called after it is unregistered. That's not the issue here.
> > > > > >
> > > > > > The issue is how to insert the ops into the list. I've done the following,
> > > > > > as you can see in the code this text is inserted between.
> > > > > >
> > > > > > ops->next = mcount_list;
> > > > > > smp_wmb();
> > > > > > mcount_list = ops;
> > > > > >
> > > > > > The read side pair is the reading of ops to ops->next, which should imply
> > > > > > a smp_rmb() just by the logic. But Peter tells me things like alpha is
> > > > > > crazy enough to do better than that! Thus, I'm asking you.
> > > >
> > > > Peter is correct when he says that Alpha does not necessarily respect data
> > > > dependencies. See the following URL for the official story:
> > > >
> > > > http://www.openvms.compaq.com/wizard/wiz_2637.html
> > > >
> > > > And I give an example hardware cache design that can result in this
> > > > situation here:
> > > >
> > > > http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
> > > >
> > > > See the discussion starting with the "Why Reorder Memory Accesses?"
> > > > heading in the second column of the first page.
> > > >
> > > > Strange, but true. It took an Alpha architect quite some time to
> > > > convince me of this back in the late 90s. ;-)
> > > >
> > > > > > Can some arch have a reader where it receives ops->next before it received
> > > > > > ops? This seems to me to be a phsyic arch, to know where ops->next is
> > > > > > before it knows ops!
> > > >
> > > > The trick is that the machine might have a split cache, with (say)
> > > > odd-numbered cache lines being processed by one half and even-numbered
> > > > lines processed by the other half. If reading CPU has one half of the
> > > > cache extremely busy (e.g., processing invalidation requests from other
> > > > CPUs) and the other half idle, memory misordering can happen in the
> > > > receiving CPU -- if the pointer is processed by the idle half, and
> > > > the pointed-to struct by the busy half, you might see the unitialized
> > > > contents of the pointed-to structure. The reading CPU must execute
> > > > a memory barrier to force ordering in this case.
> > > >
> > > > > > Remember, that the ops that is being registered, is not viewable by any
> > > > > > other CPU until mcount_list = ops. I don't see the need for a read barrier
> > > > > > in this case. But I could very well be wrong.
> > > >
> > > > And I was right there with you before my extended discussions with the
> > > > aforementioned Alpha architect!
> > >
> > > hmm, I'm still not convinced ;-)
> > >
> > > This is a unique situation. We don't need to worry about items being freed
> > > because there's too many races to allow that. The items are only to
> > > register functions and are not to be dynamically allocated or freed. In
> > > this situation we do not need to worry about deletions.
> > >
> > > The smp_wmb is only for initialization of something that is about to enter
> > > the list. It is not to protect against freeing.
> >
> > Similarly, the smp_read_barrier_depends() is only for initialization
> > of something that is about to enter the list. As with the smp_wmb()
> > primitive, smp_read_barrier_depends() also is not to protect against
> > freeing. Instead, it is rcu_read_lock() and rcu_read_unlock() that
> > protect against freeing.
> >
> > > Specifically:
> > >
> > > ops->next = mcount_list;
> > > smp_wmb();
> > > mcount_list = ops;
> > >
> > > What this is to prevent is a new item that has next = NULL being viewable
> > > to other CPUS before next is initalized.
> >
> > Were it not for aggressive compiler optimizations and DEC Alpha, you would
> > be correct. What this instead does is to do the writer's part of the job
> > of preventing such new items from being visible to other CPUs before ->next
> > is initialized. These other CPUs must do their part as well, and that
> > part is smp_read_barrier_depends() -- or rcu_dereference(), whichever is
> > most appropriate.
>
> Since the code doesn't use RCU, I'll keep with the
> smp_read_barrier_depends().
>
> >
> > > On another cpu we have (simplified by removing loop):
> > >
> > > op = mcount_list;
> > > op->func();
> > > op = op->next;
> > > if (op->next != NULL)
> > > op->func;
> > >
> > > What we want to prevent is reading of the new ops before ops->next is set.
> >
> > Understood.
> >
> > > What you are saying is that on alpha, even though the write to ops->next
> > > has completed before mcount_list is set, we can still get a reversed
> > > order?
> >
> > That is exactly what I am saying. In addition, I am saying that
> > aggressive compiler optimizations can have this same effect, even on
> > non-Alpha CPUs.
> >
> > > ops->next = mcount_list; -- in one cache line
> > > smp_wmb();
> > > mcount_list = ops; -- in another cache line
> > >
> > > Even though the ops->next is completed, we can have on another cpu:
> > >
> > > op = mcount_list; (which is the ops from above)
> > > op = op->next; -- still see the old ops->next?
> >
> > Yes, this bizarre sequence of events really can happen. The fix is to
> > do the following:
> >
> > op = mcount_list; (which is the ops from above)
> > smp_read_barrier_depends();
> > op = op->next; -- no longer see the old ops->next
> >
> > > I just want to understand this. I already put in the read_barrier_depends
> > > because it doesn't hurt on most archs anyway (nops).
> >
> > Very good!!!
> >
> > And here is the example using array indexes.
> >
> > initial state:
> >
> > struct foo {
> > int a;
> > };
> > struct foo x[ARRAY_SIZE] = { 0 };
> > struct foo *global_p = &x[0];
> > /* other variables are appropriately declared auto variables */
> >
> > /* No kmalloc() or kfree(), hence no RCU grace periods. */
> > /* In the terminology of http://lwn.net/Articles/262464/, we */
> > /* are doing only publish-subscribe, nothing else. */
> >
> > writer:
> >
> > x[cur_idx].a = 1;
> > smp_wmb(); /* or smp_mb() */
> > global_idx = cur_idx;
> >
> > reader:
> >
> > i = global_idx;
> > ta = x[i].a
> >
> > Suppose we have ARRAY_SIZE of 1. Then the standard states that the
> > results of indexing x[] with a non-zero index are undefined. Since they
> > are undefined, the compiler is within its rights to assume that the
> > index will always be zero, so that the reader code would be as follows:
> >
> > reader:
> >
> > ta = x[0].a
> >
> > No dependency, no ordering. So this totally reasonable generated code
> > could see the pre-initialized value of field a. The job of both
> > smp_read_barrier_depends() and rcu_dereference() is to tell both the
> > CPU and the compiler that such assumptions are ill-advised.
> >
>
> Paul,
>
> Thanks again for this lengthy email. It took me several readings to absorb
> it all in.
>
> I recommend that someone have a pointer to this email because it really
> does explain why read_barrier_depends is needed.
>
> Excellent job of explaining this!!! Much appreciated.
Glad you liked it!!!
Thanx, Paul
next prev parent reply other threads:[~2008-02-04 21:40 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-30 3:15 [PATCH 00/22 -v7] mcount and latency tracing utility -v7 Steven Rostedt
2008-01-30 3:15 ` [PATCH 01/22 -v7] printk - dont wakeup klogd with interrupts disabled Steven Rostedt
2008-01-30 3:15 ` [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation Steven Rostedt
2008-01-30 8:46 ` Peter Zijlstra
2008-01-30 13:08 ` Steven Rostedt
2008-01-30 14:09 ` Steven Rostedt
2008-01-30 14:25 ` Peter Zijlstra
2008-02-01 22:34 ` Paul E. McKenney
2008-02-02 1:56 ` Steven Rostedt
2008-02-02 21:41 ` Paul E. McKenney
2008-02-04 17:09 ` Steven Rostedt
2008-02-04 21:40 ` Paul E. McKenney [this message]
2008-02-04 22:03 ` Steven Rostedt
2008-02-04 22:41 ` Mathieu Desnoyers
2008-02-05 6:11 ` Paul E. McKenney
2008-02-05 5:13 ` Paul E. McKenney
2008-01-30 13:21 ` Jan Kiszka
2008-01-30 13:53 ` Steven Rostedt
2008-01-30 14:28 ` Steven Rostedt
2008-01-30 3:15 ` [PATCH 03/22 -v7] Annotate core code that should not be traced Steven Rostedt
2008-01-30 3:15 ` [PATCH 04/22 -v7] x86_64: notrace annotations Steven Rostedt
2008-01-30 3:15 ` [PATCH 05/22 -v7] add notrace annotations to vsyscall Steven Rostedt
2008-01-30 8:49 ` Peter Zijlstra
2008-01-30 13:15 ` Steven Rostedt
2008-01-30 3:15 ` [PATCH 06/22 -v7] handle accurate time keeping over long delays Steven Rostedt
2008-01-30 3:15 ` [PATCH 07/22 -v7] initialize the clock source to jiffies clock Steven Rostedt
2008-01-30 3:15 ` [PATCH 08/22 -v7] add get_monotonic_cycles Steven Rostedt
2008-01-30 3:15 ` [PATCH 09/22 -v7] add notrace annotations to timing events Steven Rostedt
2008-01-30 3:15 ` [PATCH 10/22 -v7] mcount based trace in the form of a header file library Steven Rostedt
2008-01-30 3:15 ` [PATCH 11/22 -v7] Add context switch marker to sched.c Steven Rostedt
2008-01-30 3:15 ` [PATCH 12/22 -v7] Make the task State char-string visible to all Steven Rostedt
2008-01-30 3:15 ` [PATCH 13/22 -v7] Add tracing of context switches Steven Rostedt
2008-01-30 3:15 ` [PATCH 14/22 -v7] Generic command line storage Steven Rostedt
2008-01-30 3:15 ` [PATCH 15/22 -v7] trace generic call to schedule switch Steven Rostedt
2008-01-30 3:15 ` [PATCH 16/22 -v7] Add marker in try_to_wake_up Steven Rostedt
2008-01-30 3:15 ` [PATCH 17/22 -v7] mcount tracer for wakeup latency timings Steven Rostedt
2008-01-30 9:31 ` Peter Zijlstra
2008-01-30 13:18 ` Steven Rostedt
2008-01-30 3:15 ` [PATCH 18/22 -v7] Trace irq disabled critical timings Steven Rostedt
2008-01-30 3:15 ` [PATCH 19/22 -v7] trace preempt off " Steven Rostedt
2008-01-30 9:40 ` Peter Zijlstra
2008-01-30 13:40 ` Steven Rostedt
2008-01-30 3:15 ` [PATCH 20/22 -v7] Add markers to various events Steven Rostedt
2008-01-30 3:15 ` [PATCH 21/22 -v7] Add event tracer Steven Rostedt
2008-01-30 3:15 ` [PATCH 22/22 -v7] Critical latency timings histogram Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080204214023.GC29646@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=acme@ghostprotocols.net \
--cc=akpm@linux-foundation.org \
--cc=arjan@infradead.org \
--cc=fche@redhat.com \
--cc=ghaskins@novell.com \
--cc=hch@infradead.org \
--cc=jan.kiszka@siemens.com \
--cc=johnstul@us.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@polymtl.ca \
--cc=mingo@elte.hu \
--cc=rostedt@goodmis.org \
--cc=sam@ravnborg.org \
--cc=srostedt@redhat.com \
--cc=tglx@linutronix.de \
--cc=tim.bird@am.sony.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.