* [PATCH 0/2] tracing: Replace final 64-bit cmpxchg with compare and update if available
@ 2023-12-15 16:55 Steven Rostedt
2023-12-15 16:55 ` [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update() Steven Rostedt
2023-12-15 16:55 ` [PATCH 2/2] ring-buffer: Remove 32bit timestamp logic Steven Rostedt
0 siblings, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2023-12-15 16:55 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds
With the introduction of a389d86f7fd09 ("ring-buffer: Have nested events
still record running time stamp"), the timestamps required needing 64-bit
cmpxchg. As some architectures do no even have a 64-bit cmpxchg, the code
for 32-bit architectures used 3 32-bit words that represented the 64-bit
timestamp and this also required performing 3 32-bit cmpxchg where a single
64-bit would do.
While debugging the ring-buffer, it was found that the locations of 3 of the
4 cmpxchg() were actually causing bugs, and the code was restructured to
remove them! See:
https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
The only remaining cmpcxhg() is in a slow path that gets hit if an interrupt
were to come in during the allocation of an event and itself would right an
event to the same buffer. The interrupted event would detect this, and use
the cmpxchg for two purposes. One was to know if the interrupt happened
before it allocated its space (where it can use the timestamp that was
found), and the other is to set the write_stamp back to matching the
before_stamp, where the event after the interrupted event would not need to
add an absolute timestamp (it's 8 bytes added to the ring buffer).
The first purpose of the cmpxchg could also be done with a simple compare.
The second purpose (avoiding the addition of the absolute timestamp)
requires the cmpxchg. Considering the complexity of the 32-bit version of
the 64-bit cmpxchg, avoiding an added absolute timestamp doesn't justify it.
The first patch replaces the last 64-bit cmpxchg() with a
rb_time_cmp_and_update() that will return true if the timestamp matches the
expected result. It will optionally update the timestamp with the "set"
parameter if cmpxchg is available.
The second patch removes the 32-bit version of the 64-bit cmpxchg and simply
does the compare. This also removes the complex logic of that code. The
difference now is that 32-bit architectures will have to add an absolute
timestamp to an event that follows an event that suffered the race
condition.
Steven Rostedt (Google) (2):
ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
ring-buffer: Remove 32bit timestamp logic
----
kernel/trace/ring_buffer.c | 247 +++++++--------------------------------------
1 file changed, 36 insertions(+), 211 deletions(-)
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
2023-12-15 16:55 [PATCH 0/2] tracing: Replace final 64-bit cmpxchg with compare and update if available Steven Rostedt
@ 2023-12-15 16:55 ` Steven Rostedt
2023-12-18 14:24 ` Masami Hiramatsu
2023-12-15 16:55 ` [PATCH 2/2] ring-buffer: Remove 32bit timestamp logic Steven Rostedt
1 sibling, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2023-12-15 16:55 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
There's only one place that performs a 64-bit cmpxchg for the timestamp
processing. The cmpxchg is only to set the write_stamp equal to the
before_stamp, and if it doesn't get set, then the next event will simply
be forced to add an absolute timestamp.
Given that 64-bit cmpxchg is expensive on 32-bit, and the current
workaround uses 3 consecutive 32-bit cmpxchg doesn't make it any faster.
It's best to just not do the cmpxchg as a simple compare works for the
accuracy of the timestamp. The only thing that will happen without the
cmpxchg is the prepended absolute timestamp on the next event which is not
that big of a deal as the path where this happens is seldom hit because it
requires an interrupt to happen between a few lines of code that also
writes an event into the same buffer.
With this change, the 32-bit rb_time_t workaround can be removed.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1a26b3a1901f..c6842a4331a9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -762,6 +762,15 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
}
#endif
+/*
+ * Returns true if t == expect, and if possible will update with set,
+ * but t is not guaranteed to be updated even if this returns true
+ */
+static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
+{
+ return rb_time_cmpxchg(t, expect, set);
+}
+
/*
* Enable this to make sure that the event passed to
* ring_buffer_event_time_stamp() is not committed and also
@@ -3622,9 +3631,17 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
barrier();
/*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
info->after < ts &&
- rb_time_cmpxchg(&cpu_buffer->write_stamp,
- info->after, ts)) {
- /* Nothing came after this event between C and E */
+ rb_time_cmp_and_update(&cpu_buffer->write_stamp,
+ info->after, ts)) {
+ /*
+ * Nothing came after this event between C and E it is
+ * safe to use info->after for the delta.
+ * The above rb_time_cmp_and_update() may or may not
+ * have updated the write_stamp. If it did not then
+ * the next event will simply add an absolute timestamp
+ * as the write_stamp will be different than the
+ * before_stamp.
+ */
info->delta = ts - info->after;
} else {
/*
--
2.42.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] ring-buffer: Remove 32bit timestamp logic
2023-12-15 16:55 [PATCH 0/2] tracing: Replace final 64-bit cmpxchg with compare and update if available Steven Rostedt
2023-12-15 16:55 ` [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update() Steven Rostedt
@ 2023-12-15 16:55 ` Steven Rostedt
1 sibling, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2023-12-15 16:55 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.
Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.
To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.
Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.
After some more refactoring of the code, 3 of the 4 64-bit cmpxchg were
removed:
https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
The last remaining 64-bit cmpxchg is only in a slow path and to keep the
next event from having to add an absolute timestamp. This is not worth the
trouble of a slow cmpxchg. Only do the cmpxchg on 64-bit architectures and
just have the 32-bit architectures insert an absolute timestamp for the
next event that comes after the slow path. The slow path only happens if
the event being recorded is interrupted between a few lines of code and
that interrupt writes an event into the same buffer. This seldom happens
and should not cause any issues by inserting an extra absolute timestamp
when it does happen.
Now the complex 32-bit workaround for 64-bit cmpxchg can be removed.
Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 230 +++----------------------------------
1 file changed, 19 insertions(+), 211 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c6842a4331a9..47f9eda99769 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -463,27 +463,9 @@ enum {
RB_CTX_MAX
};
-#if BITS_PER_LONG == 32
-#define RB_TIME_32
-#endif
-
-/* To test on 64 bit machines */
-//#define RB_TIME_32
-
-#ifdef RB_TIME_32
-
-struct rb_time_struct {
- local_t cnt;
- local_t top;
- local_t bottom;
- local_t msb;
-};
-#else
-#include <asm/local64.h>
struct rb_time_struct {
local64_t time;
};
-#endif
typedef struct rb_time_struct rb_time_t;
#define MAX_NEST 5
@@ -573,202 +555,35 @@ struct ring_buffer_iter {
int missed_events;
};
-#ifdef RB_TIME_32
-
-/*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
- *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
- *
- * - Reads may fail if it interrupted a modification of the time stamp.
- * It will succeed if it did not interrupt another write even if
- * the read itself is interrupted by a write.
- * It returns whether it was successful or not.
- *
- * - Writes always succeed and will overwrite other writes and writes
- * that were done by events interrupting the current write.
- *
- * - A write followed by a read of the same time stamp will always succeed,
- * but may not contain the same value.
- *
- * - A cmpxchg will fail if it interrupted another write or cmpxchg.
- * Other than that, it acts like a normal cmpxchg.
- *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- * (bottom being the least significant 30 bits of the 60 bit time stamp).
- *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
- * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- * top and bottom together make a valid 60 bit number.
- */
-#define RB_TIME_SHIFT 30
-#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
-#define RB_TIME_MSB_SHIFT 60
-
-static inline int rb_time_cnt(unsigned long val)
-{
- return (val >> RB_TIME_SHIFT) & 3;
-}
-
-static inline u64 rb_time_val(unsigned long top, unsigned long bottom)
-{
- u64 val;
-
- val = top & RB_TIME_VAL_MASK;
- val <<= RB_TIME_SHIFT;
- val |= bottom & RB_TIME_VAL_MASK;
-
- return val;
-}
-
-static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
-{
- unsigned long top, bottom, msb;
- unsigned long c;
-
- /*
- * If the read is interrupted by a write, then the cnt will
- * be different. Loop until both top and bottom have been read
- * without interruption.
- */
- do {
- c = local_read(&t->cnt);
- top = local_read(&t->top);
- bottom = local_read(&t->bottom);
- msb = local_read(&t->msb);
- } while (c != local_read(&t->cnt));
-
- *cnt = rb_time_cnt(top);
-
- /* If top, msb or bottom counts don't match, this interrupted a write */
- if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom))
- return false;
-
- /* The shift to msb will lose its cnt bits */
- *ret = rb_time_val(top, bottom) | ((u64)msb << RB_TIME_MSB_SHIFT);
- return true;
-}
-
-static bool rb_time_read(rb_time_t *t, u64 *ret)
-{
- unsigned long cnt;
-
- return __rb_time_read(t, ret, &cnt);
-}
-
-static inline unsigned long rb_time_val_cnt(unsigned long val, unsigned long cnt)
-{
- return (val & RB_TIME_VAL_MASK) | ((cnt & 3) << RB_TIME_SHIFT);
-}
-
-static inline void rb_time_split(u64 val, unsigned long *top, unsigned long *bottom,
- unsigned long *msb)
-{
- *top = (unsigned long)((val >> RB_TIME_SHIFT) & RB_TIME_VAL_MASK);
- *bottom = (unsigned long)(val & RB_TIME_VAL_MASK);
- *msb = (unsigned long)(val >> RB_TIME_MSB_SHIFT);
-}
-
-static inline void rb_time_val_set(local_t *t, unsigned long val, unsigned long cnt)
-{
- val = rb_time_val_cnt(val, cnt);
- local_set(t, val);
-}
-
-static void rb_time_set(rb_time_t *t, u64 val)
-{
- unsigned long cnt, top, bottom, msb;
-
- rb_time_split(val, &top, &bottom, &msb);
-
- /* Writes always succeed with a valid number even if it gets interrupted. */
- do {
- cnt = local_inc_return(&t->cnt);
- rb_time_val_set(&t->top, top, cnt);
- rb_time_val_set(&t->bottom, bottom, cnt);
- rb_time_val_set(&t->msb, val >> RB_TIME_MSB_SHIFT, cnt);
- } while (cnt != local_read(&t->cnt));
-}
-
-static inline bool
-rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
-{
- return local_try_cmpxchg(l, &expect, set);
-}
-
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
- unsigned long cnt, top, bottom, msb;
- unsigned long cnt2, top2, bottom2, msb2;
- u64 val;
-
- /* Any interruptions in this function should cause a failure */
- cnt = local_read(&t->cnt);
-
- /* The cmpxchg always fails if it interrupted an update */
- if (!__rb_time_read(t, &val, &cnt2))
- return false;
-
- if (val != expect)
- return false;
-
- if ((cnt & 3) != cnt2)
- return false;
-
- cnt2 = cnt + 1;
-
- rb_time_split(val, &top, &bottom, &msb);
- msb = rb_time_val_cnt(msb, cnt);
- top = rb_time_val_cnt(top, cnt);
- bottom = rb_time_val_cnt(bottom, cnt);
-
- rb_time_split(set, &top2, &bottom2, &msb2);
- msb2 = rb_time_val_cnt(msb2, cnt);
- top2 = rb_time_val_cnt(top2, cnt2);
- bottom2 = rb_time_val_cnt(bottom2, cnt2);
-
- if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
- return false;
- if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
- return false;
- if (!rb_time_read_cmpxchg(&t->top, top, top2))
- return false;
- if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
- return false;
- return true;
-}
-
-#else /* 64 bits */
-
-/* local64_t always succeeds */
-
-static inline bool rb_time_read(rb_time_t *t, u64 *ret)
+static inline void rb_time_read(rb_time_t *t, u64 *ret)
{
*ret = local64_read(&t->time);
- return true;
}
static void rb_time_set(rb_time_t *t, u64 val)
{
local64_set(&t->time, val);
}
-static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
-{
- return local64_try_cmpxchg(&t->time, &expect, set);
-}
+#if BITS_PER_LONG == 32
+#define RB_TIME_32
+#else
+#include <asm/local64.h>
#endif
+/* To test on 64 bit machines */
+//#define RB_TIME_32
+
/*
* Returns true if t == expect, and if possible will update with set,
* but t is not guaranteed to be updated even if this returns true
*/
static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
{
- return rb_time_cmpxchg(t, expect, set);
+#ifdef RB_TIME_32
+ return expect == READ_ONCE(t->time);
+#else
+ return local64_try_cmpxchg(&t->time, &expect, set);
+#endif
}
/*
@@ -876,10 +691,7 @@ u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer,
WARN_ONCE(1, "nest (%d) greater than max", nest);
fail:
- /* Can only fail on 32 bit */
- if (!rb_time_read(&cpu_buffer->write_stamp, &ts))
- /* Screw it, just read the current time */
- ts = rb_time_stamp(cpu_buffer->buffer);
+ rb_time_read(&cpu_buffer->write_stamp, &ts);
return ts;
}
@@ -2876,7 +2688,7 @@ rb_check_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
(unsigned long long)info->ts,
(unsigned long long)info->before,
(unsigned long long)info->after,
- (unsigned long long)(rb_time_read(&cpu_buffer->write_stamp, &write_stamp) ? write_stamp : 0),
+ (unsigned long long)({rb_time_read(&cpu_buffer->write_stamp, &write_stamp); write_stamp;}),
sched_clock_stable() ? "" :
"If you just came from a suspend/resume,\n"
"please switch to the trace global clock:\n"
@@ -3553,16 +3365,14 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
struct ring_buffer_event *event;
struct buffer_page *tail_page;
unsigned long tail, write, w;
- bool a_ok;
- bool b_ok;
/* Don't let the compiler play games with cpu_buffer->tail_page */
tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
/*A*/ w = local_read(&tail_page->write) & RB_WRITE_MASK;
barrier();
- b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
- a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
+ rb_time_read(&cpu_buffer->before_stamp, &info->before);
+ rb_time_read(&cpu_buffer->write_stamp, &info->after);
barrier();
info->ts = rb_time_stamp(cpu_buffer->buffer);
@@ -3577,7 +3387,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
if (!w) {
/* Use the sub-buffer timestamp */
info->delta = 0;
- } else if (unlikely(!a_ok || !b_ok || info->before != info->after)) {
+ } else if (unlikely(info->before != info->after)) {
info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
} else {
@@ -3624,9 +3434,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
} else {
u64 ts;
/* SLOW PATH - Interrupted between A and C */
- a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
- /* Was interrupted before here, write_stamp must be valid */
- RB_WARN_ON(cpu_buffer, !a_ok);
+ rb_time_read(&cpu_buffer->write_stamp, &info->after);
ts = rb_time_stamp(cpu_buffer->buffer);
barrier();
/*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
--
2.42.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
2023-12-15 16:55 ` [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update() Steven Rostedt
@ 2023-12-18 14:24 ` Masami Hiramatsu
2023-12-18 15:15 ` Steven Rostedt
0 siblings, 1 reply; 7+ messages in thread
From: Masami Hiramatsu @ 2023-12-18 14:24 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linus Torvalds
On Fri, 15 Dec 2023 11:55:13 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>
> There's only one place that performs a 64-bit cmpxchg for the timestamp
> processing. The cmpxchg is only to set the write_stamp equal to the
> before_stamp, and if it doesn't get set, then the next event will simply
> be forced to add an absolute timestamp.
>
> Given that 64-bit cmpxchg is expensive on 32-bit, and the current
> workaround uses 3 consecutive 32-bit cmpxchg doesn't make it any faster.
> It's best to just not do the cmpxchg as a simple compare works for the
> accuracy of the timestamp. The only thing that will happen without the
> cmpxchg is the prepended absolute timestamp on the next event which is not
> that big of a deal as the path where this happens is seldom hit because it
> requires an interrupt to happen between a few lines of code that also
> writes an event into the same buffer.
>
> With this change, the 32-bit rb_time_t workaround can be removed.
>
Hmm, but this patch itself is just moving rb_time_cmpxchg() in the new
rb_time_cmp_and_update() function. The actual change has been done
in the next patch. I think there is no reason to split this from the
second one...
Isn't this part actual change?
> static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
> {
> - return rb_time_cmpxchg(t, expect, set);
> +#ifdef RB_TIME_32
> + return expect == READ_ONCE(t->time);
> +#else
> + return local64_try_cmpxchg(&t->time, &expect, set);
> +#endif
> }
Thank you,
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> kernel/trace/ring_buffer.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 1a26b3a1901f..c6842a4331a9 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -762,6 +762,15 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> }
> #endif
>
> +/*
> + * Returns true if t == expect, and if possible will update with set,
> + * but t is not guaranteed to be updated even if this returns true
> + */
> +static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
> +{
> + return rb_time_cmpxchg(t, expect, set);
> +}
> +
> /*
> * Enable this to make sure that the event passed to
> * ring_buffer_event_time_stamp() is not committed and also
> @@ -3622,9 +3631,17 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
> barrier();
> /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> info->after < ts &&
> - rb_time_cmpxchg(&cpu_buffer->write_stamp,
> - info->after, ts)) {
> - /* Nothing came after this event between C and E */
> + rb_time_cmp_and_update(&cpu_buffer->write_stamp,
> + info->after, ts)) {
> + /*
> + * Nothing came after this event between C and E it is
> + * safe to use info->after for the delta.
> + * The above rb_time_cmp_and_update() may or may not
> + * have updated the write_stamp. If it did not then
> + * the next event will simply add an absolute timestamp
> + * as the write_stamp will be different than the
> + * before_stamp.
> + */
> info->delta = ts - info->after;
> } else {
> /*
> --
> 2.42.0
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
2023-12-18 14:24 ` Masami Hiramatsu
@ 2023-12-18 15:15 ` Steven Rostedt
2023-12-18 18:42 ` Steven Rostedt
0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2023-12-18 15:15 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linus Torvalds
On Mon, 18 Dec 2023 23:24:55 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> On Fri, 15 Dec 2023 11:55:13 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> >
> > There's only one place that performs a 64-bit cmpxchg for the timestamp
> > processing. The cmpxchg is only to set the write_stamp equal to the
> > before_stamp, and if it doesn't get set, then the next event will simply
> > be forced to add an absolute timestamp.
> >
> > Given that 64-bit cmpxchg is expensive on 32-bit, and the current
> > workaround uses 3 consecutive 32-bit cmpxchg doesn't make it any faster.
> > It's best to just not do the cmpxchg as a simple compare works for the
> > accuracy of the timestamp. The only thing that will happen without the
> > cmpxchg is the prepended absolute timestamp on the next event which is not
> > that big of a deal as the path where this happens is seldom hit because it
> > requires an interrupt to happen between a few lines of code that also
> > writes an event into the same buffer.
> >
> > With this change, the 32-bit rb_time_t workaround can be removed.
> >
>
> Hmm, but this patch itself is just moving rb_time_cmpxchg() in the new
> rb_time_cmp_and_update() function. The actual change has been done
> in the next patch.
Exactly. Which is why I said above "with this change, the 32-bit rb_time_t
workaround can be removed". It can't be removed without this change.
> I think there is no reason to split this from the
> second one...
I originally had it as one patch, but I disliked the removal of the
workaround touching the main logic code (which this patch does).
Basically I broke it into:
1. Remove workaround exposure from the main logic. (this patch)
2. Remove the workaround. (next patch).
>
> Isn't this part actual change?
This part is abstracted out from the main logic. Which is why I made this
patch.
>
> > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
> > {
> > - return rb_time_cmpxchg(t, expect, set);
> > +#ifdef RB_TIME_32
> > + return expect == READ_ONCE(t->time);
And I need to make a v2 as the above is wrong. It should have been:
return expect == local64_read(&t->time);
-- Steve
> > +#else
> > + return local64_try_cmpxchg(&t->time, &expect, set);
> > +#endif
> > }
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
2023-12-18 15:15 ` Steven Rostedt
@ 2023-12-18 18:42 ` Steven Rostedt
2023-12-18 22:52 ` Steven Rostedt
0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2023-12-18 18:42 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linus Torvalds
On Mon, 18 Dec 2023 10:15:31 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> Basically I broke it into:
>
> 1. Remove workaround exposure from the main logic. (this patch)
> 2. Remove the workaround. (next patch).
>
> >
> > Isn't this part actual change?
>
> This part is abstracted out from the main logic. Which is why I made this
> patch.
>
> >
> > > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
> > > {
> > > - return rb_time_cmpxchg(t, expect, set);
> > > +#ifdef RB_TIME_32
> > > + return expect == READ_ONCE(t->time);
>
> And I need to make a v2 as the above is wrong. It should have been:
>
> return expect == local64_read(&t->time);
My v2 version will also make 64 bit not guaranteed to update on return of
true. Which adds even more reason to separate out the two.
-- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
2023-12-18 18:42 ` Steven Rostedt
@ 2023-12-18 22:52 ` Steven Rostedt
0 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2023-12-18 22:52 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linus Torvalds
On Mon, 18 Dec 2023 13:42:40 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > > static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
> > > > {
> > > > - return rb_time_cmpxchg(t, expect, set);
> > > > +#ifdef RB_TIME_32
> > > > + return expect == READ_ONCE(t->time);
> >
> > And I need to make a v2 as the above is wrong. It should have been:
> >
> > return expect == local64_read(&t->time);
>
>
> My v2 version will also make 64 bit not guaranteed to update on return of
> true. Which adds even more reason to separate out the two.
This code was failing my tests, and after figuring out why, I realized I
can remove this 64-bit cmpxchg for both 64-bit and 32-bit architectures!
I was thinking that the 64-bit cmpxchg() was to keep from adding an
absolute timestamp to after the slowpath. But that was not the case. It was
actually going to *add* a absolute timestamp if it succeeded. Well, not in
every case, but in some cases.
First let me explain the purpose of this last 64-bit cmpxchg:
Before reserving the data on the ring buffer, we do:
/*A*/ w = local_read(&tail_page->write) & RB_WRITE_MASK;
barrier();
b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
barrier();
info->ts = rb_time_stamp(cpu_buffer->buffer);
The 'w' is the location we expect to have our data on.
Then we read the two timestamps "before_stamp" and "write_stamp" and save
them locally on the stack (info is a stack item).
Then we read the current timestamp and place it into info->ts.
/*B*/ rb_time_set(&cpu_buffer->before_stamp, info->ts);
All events written into the ring buffer will write into the "before_stamp"
before reserving the ring buffer. I cut out the logic above that makes sure
that "before_stamp" matches "write_stamp", as if they do not match, the
"write_stamp" can not be trusted. As soon as the "before_stamp" is written
to, then any interrupting events will not trust the "write_stamp" and add
its own absolute timestamp, but also update the write_stamp to its
before_stamp so that later events can be trusted.
Now reserve the data for this event on the ring buffer:
/*C*/ write = local_add_return(info->length, &tail_page->write);
tail = write - info->length;
if (likely(tail == w)) {
The above reserves the data for the event on the ring buffer, and then
checks to see if where it was reserved matches where it expected to be
reserved at the start of the operations above.
[...]
} else {
Now we go into the slow path of where some thing snucked in between /*A*/
and /*C*/
a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
Read the write_stamp again.
ts = rb_time_stamp(cpu_buffer->buffer);
Get the current timestamp
barrier();
/*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
info->after < ts &&
If "write" from /*C*/ still matches the current location on the ring
buffer, we know that nothing came in between /*C*/ and /*E*/
rb_time_cmpxchg(&cpu_buffer->write_stamp,
info->after, ts)) {
info->delta = ts - info->after;
Now the above cmpxchg() is needed in this code because we need to update
the write_stamp to the current timestamp so that we can safely calculate the
delta.
But by writing to the write_stamp via the cmpxchg() we are actually making
it different than the before_stamp, as the interruption could have been
between /*B*/ and /*C*/, and this would also force the next event to use an
absolute timestamp. Of course, if it happened between /*A*/ and /*B*/ it is
actually validating the write_stamp again by making it match the
before_stamp.
The real reason we need to do the cmpxchg() is because of a possible
interruption after /*E*/ above. The write_stamp needs to be updated
atomically, otherwise the interrupting event that comes after /*E*/, will
be using the previous interruption write_stamp to calculate its delta from.
For example:
/*B*/ rb_time_set(&cpu_buffer->before_stamp, info->ts);
---> interrupt, updates before_stamp and write_stamp to its own ts
/*C*/ write = local_add_return(info->length, &tail_page->write);
/* Allocated memory, any interruptions here, will be using the previous
interrupt ts and not this event */
/*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
info->after < ts &&
If write == tail_page->write than nothing came in, so it is still safe to
add a delta, but only if we atomically update the write_stamp, forcing
interrupting events to add an absolute timestamp. If it is not done
atomically, then there's a race that another interrupt could come in, and
use the previous interrupt event to calculate its delta, even though, this
event is in between the two.
BUT! we can also do this instead without the cmpcxchg! Especially as
success of cmpxchg could possibly force a absolute timestamp. If we always
force the absolute timestamp, then there's no need for the cmpxchg at all.
By doing:
/*
* Read a new timestamp and update before_stamp with it.
* Any new event coming in now will use an absolute timestamp
*/
ts = rb_time_stamp(cpu_buffer->buffer);
rb_time_set(&cpu_buffer->before_stamp, ts);
/* Next event will force an absolute timestamp */
barrier();
a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
barrier();
/*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
info->after < ts) {
/*
* If nothing came in between C and E, it is safe
* to use the write_stamp as the delta.
*/
info->delta = ts - info->after;
} else {
/*
* Interrupted twice and the second interruption is possibly
* using the first interruption to calculate its delta. Just
* set our delta to zero to not mess the event that came in
* after up.
*/
info->delta = 0;
}
With this logic, we do not need any 64-bit cmpxchg() at all!
-- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-12-18 22:51 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-15 16:55 [PATCH 0/2] tracing: Replace final 64-bit cmpxchg with compare and update if available Steven Rostedt
2023-12-15 16:55 ` [PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update() Steven Rostedt
2023-12-18 14:24 ` Masami Hiramatsu
2023-12-18 15:15 ` Steven Rostedt
2023-12-18 18:42 ` Steven Rostedt
2023-12-18 22:52 ` Steven Rostedt
2023-12-15 16:55 ` [PATCH 2/2] ring-buffer: Remove 32bit timestamp logic Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox