public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	Mark Rutland <mark.rutland@arm.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
Date: Mon, 18 Dec 2023 17:52:29 -0500	[thread overview]
Message-ID: <20231218175229.58ec3daf@gandalf.local.home> (raw)
In-Reply-To: <20231218134240.4ed0ecbd@gandalf.local.home>

On Mon, 18 Dec 2023 13:42:40 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > >     
> > > >  static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
> > > >  {
> > > > -	return rb_time_cmpxchg(t, expect, set);
> > > > +#ifdef RB_TIME_32
> > > > +	return expect == READ_ONCE(t->time);    
> > 
> > And I need to make a v2 as the above is wrong. It should have been:
> > 
> > 	return expect == local64_read(&t->time);  
> 
> 
> My v2 version will also make 64 bit not guaranteed to update on return of
> true. Which adds even more reason to separate out the two.

This code was failing my tests, and after figuring out why, I realized I
can remove this 64-bit cmpxchg for both 64-bit and 32-bit architectures!

I was thinking that the 64-bit cmpxchg() was to keep from adding an
absolute timestamp to after the slowpath. But that was not the case. It was
actually going to *add* a absolute timestamp if it succeeded. Well, not in
every case, but in some cases.

First let me explain the purpose of this last 64-bit cmpxchg:

Before reserving the data on the ring buffer, we do:


 /*A*/	w = local_read(&tail_page->write) & RB_WRITE_MASK;
	barrier();
	b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
	a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
	barrier();
	info->ts = rb_time_stamp(cpu_buffer->buffer);

The 'w' is the location we expect to have our data on.

Then we read the two timestamps "before_stamp" and "write_stamp" and save
them locally on the stack (info is a stack item).

Then we read the current timestamp and place it into info->ts.

 /*B*/	rb_time_set(&cpu_buffer->before_stamp, info->ts);

All events written into the ring buffer will write into the "before_stamp"
before reserving the ring buffer. I cut out the logic above that makes sure
that "before_stamp" matches "write_stamp", as if they do not match, the
"write_stamp" can not be trusted. As soon as the "before_stamp" is written
to, then any interrupting events will not trust the "write_stamp" and add
its own absolute timestamp, but also update the write_stamp to its
before_stamp so that later events can be trusted.

Now reserve the data for this event on the ring buffer:

 /*C*/	write = local_add_return(info->length, &tail_page->write);

	tail = write - info->length;

	if (likely(tail == w)) {

The above reserves the data for the event on the ring buffer, and then
checks to see if where it was reserved matches where it expected to be
reserved at the start of the operations above.

		[...]

	} else {

Now we go into the slow path of where some thing snucked in between /*A*/
and /*C*/

		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);

Read the write_stamp again.

		ts = rb_time_stamp(cpu_buffer->buffer);

Get the current timestamp

		barrier();
 /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
		    info->after < ts &&

If "write" from /*C*/ still matches the current location on the ring
buffer, we know that nothing came in between /*C*/ and /*E*/


		    rb_time_cmpxchg(&cpu_buffer->write_stamp,
				    info->after, ts)) {

			info->delta = ts - info->after;

Now the above cmpxchg() is needed in this code because we need to update
the write_stamp to the current timestamp so that we can safely calculate the
delta.

But by writing to the write_stamp via the cmpxchg() we are actually making
it different than the before_stamp, as the interruption could have been
between /*B*/ and /*C*/, and this would also force the next event to use an
absolute timestamp. Of course, if it happened between /*A*/ and /*B*/ it is
actually validating the write_stamp again by making it match the
before_stamp.

The real reason we need to do the cmpxchg() is because of a possible
interruption after /*E*/ above. The write_stamp needs to be updated
atomically, otherwise the interrupting event that comes after /*E*/, will
be using the previous interruption write_stamp to calculate its delta from.

For example:

 /*B*/	rb_time_set(&cpu_buffer->before_stamp, info->ts);

  ---> interrupt, updates before_stamp and write_stamp to its own ts

 /*C*/	write = local_add_return(info->length, &tail_page->write);

/* Allocated memory, any interruptions here, will be using the previous
   interrupt ts and not this event */

 /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
		    info->after < ts &&

If write == tail_page->write than nothing came in, so it is still safe to
add a delta, but only if we atomically update the write_stamp, forcing
interrupting events to add an absolute timestamp. If it is not done
atomically, then there's a race that another interrupt could come in, and
use the previous interrupt event to calculate its delta, even though, this
event is in between the two.

BUT! we can also do this instead without the cmpcxchg! Especially as
success of cmpxchg could possibly force a absolute timestamp. If we always
force the absolute timestamp, then there's no need for the cmpxchg at all.

By doing:

		/*
		 * Read a new timestamp and update before_stamp with it.
		 * Any new event coming in now will use an absolute timestamp
		 */
		ts = rb_time_stamp(cpu_buffer->buffer);
		rb_time_set(&cpu_buffer->before_stamp, ts);

		/* Next event will force an absolute timestamp */

		barrier();
		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
		barrier();

 /*E*/		if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
		    info->after < ts) {

			/*
			 * If nothing came in between C and E, it is safe
			 * to use the write_stamp as the delta.
			 */
			info->delta = ts - info->after;
		} else {
			/*
			 * Interrupted twice and the second interruption is possibly
			 * using the first interruption to calculate its delta. Just
			 * set our delta to zero to not mess the event that came in
			 * after up.
			 */
			info->delta = 0;
		}


With this logic, we do not need any 64-bit cmpxchg() at all!

-- Steve

  reply	other threads:[~2023-12-18 22:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-15 16:55 [PATCH 0/2] tracing: Replace final 64-bit cmpxchg with compare and update if available Steven Rostedt
2023-12-15 16:55 ` [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update() Steven Rostedt
2023-12-18 14:24   ` Masami Hiramatsu
2023-12-18 15:15     ` Steven Rostedt
2023-12-18 18:42       ` Steven Rostedt
2023-12-18 22:52         ` Steven Rostedt [this message]
2023-12-15 16:55 ` [PATCH 2/2] ring-buffer: Remove 32bit timestamp logic Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231218175229.58ec3daf@gandalf.local.home \
    --to=rostedt@goodmis.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox