linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
@ 2025-05-13  0:50 Masami Hiramatsu (Google)
  2025-05-14  0:32 ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Masami Hiramatsu (Google) @ 2025-05-13  0:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Rewind persistent ring buffer pages which have been read in the
previous boot. Those pages are highly possible to be lost before
writing it to the disk if the previous kernel crashed. In this
case, the trace data is kept on the persistent ring buffer, but
it can not be read because its commit size has been reset after
read.
This skips clearing the commit size of each sub-buffer and
recover it after reboot.

Note: If you read the previous boot data via trace_pipe, that
is not accessible in that time. But reboot without clearing (or
reusing) the read data, the read data is recovered again in the
next boot.
Thus, when you read the previous boot data, clear it by
`echo > trace`.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/ring_buffer.c |   55 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1ca482955dae..a1014c8b8772 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1358,6 +1358,13 @@ static inline void rb_inc_page(struct buffer_page **bpage)
 	*bpage = list_entry(p, struct buffer_page, list);
 }
 
+static inline void rb_dec_page(struct buffer_page **bpage)
+{
+	struct list_head *p = rb_list_head((*bpage)->list.prev);
+
+	*bpage = list_entry(p, struct buffer_page, list);
+}
+
 static struct buffer_page *
 rb_set_head_page(struct ring_buffer_per_cpu *cpu_buffer)
 {
@@ -1866,9 +1873,10 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
 static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
-	struct buffer_page *head_page;
+	struct buffer_page *head_page, *orig_head;
 	unsigned long entry_bytes = 0;
 	unsigned long entries = 0;
+	bool zero_commit;
 	int ret;
 	int i;
 
@@ -1885,7 +1893,49 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
 	local_set(&cpu_buffer->reader_page->entries, ret);
 
-	head_page = cpu_buffer->head_page;
+	orig_head = head_page = cpu_buffer->head_page;
+
+	/*
+	 * Try to rewind the head so that we can read the pages which already
+	 * read in the previous boot.
+	 */
+	zero_commit = false;
+	rb_dec_page(&head_page);
+	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_dec_page(&head_page)) {
+
+		/* Rewind until tail (writer) page. */
+		if (head_page == cpu_buffer->tail_page ||
+		    local_read(&head_page->write) /* under writing */)
+			break;
+
+		/* Stop rewind if the page is invalid. */
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		if (ret < 0)
+			break;
+
+		/* Recover the number of entries. */
+		local_set(&head_page->entries, ret);
+		if (ret)
+			local_inc(&cpu_buffer->pages_touched);
+		entries += ret;
+		entry_bytes += rb_page_commit(head_page);
+		/* If there are 2 zero commit pages, stop rewind. */
+		if (rb_page_commit(head_page) == 0) {
+			if (zero_commit)
+				break;
+			zero_commit = true;
+		} else
+			zero_commit = false;
+	}
+	/* The last rewind page must be skipped. */
+	rb_inc_page(&head_page);
+
+	if (head_page != orig_head) {
+		cpu_buffer->head_page = head_page;
+		head_page = orig_head;
+		if (zero_commit && rb_page_commit(head_page) == 0)
+			rb_inc_page(&head_page);
+	}
 
 	/* If both the head and commit are on the reader_page then we are done. */
 	if (head_page == cpu_buffer->reader_page &&
@@ -5346,7 +5396,6 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	 */
 	local_set(&cpu_buffer->reader_page->write, 0);
 	local_set(&cpu_buffer->reader_page->entries, 0);
-	local_set(&cpu_buffer->reader_page->page->commit, 0);
 	cpu_buffer->reader_page->real_end = 0;
 
  spin:


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
  2025-05-13  0:50 [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot Masami Hiramatsu (Google)
@ 2025-05-14  0:32 ` Steven Rostedt
  2025-05-14  0:36   ` Steven Rostedt
  2025-05-14  6:00   ` Masami Hiramatsu
  0 siblings, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2025-05-14  0:32 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel

On Tue, 13 May 2025 09:50:27 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Rewind persistent ring buffer pages which have been read in the
> previous boot. Those pages are highly possible to be lost before
> writing it to the disk if the previous kernel crashed. In this
> case, the trace data is kept on the persistent ring buffer, but
> it can not be read because its commit size has been reset after
> read.
> This skips clearing the commit size of each sub-buffer and
> recover it after reboot.
> 
> Note: If you read the previous boot data via trace_pipe, that
> is not accessible in that time. But reboot without clearing (or
> reusing) the read data, the read data is recovered again in the
> next boot.
> Thus, when you read the previous boot data, clear it by
> `echo > trace`.

Is that a problem? I'm thinking that the data in the buffer should not be
used. Anyway, I had to fix a bug with trace_pipe:

  https://lore.kernel.org/linux-trace-kernel/20250513115032.3e0b97f7@gandalf.local.home/

And with that applied, it doesn't rewind all the way, but still gives you
what's left on the reader page. I think that is the case without this patch
too.

> 
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  kernel/trace/ring_buffer.c |   55
> ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 52
> insertions(+), 3 deletions(-)

Anyway, I found that this is a lot more complex than what you have here. As
the commit size is set to zero when added back into the ring buffer, I
needed to remove that. But things get much more trickier than that!

First, we need to make the new head page found (the one that was read
already) the new "reader page". And we need to insert the old reader page
back into the ring buffer. And reset everything to handle all of this.

I played a little with it, and came up with this. It's still a bit buggy,
but it does give you an idea of what needs to be done.

-- Steve

This is on top of the trace_pipe fix in the link above.

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6859008ca34d..f5dd96ca25c9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1358,6 +1358,13 @@ static inline void rb_inc_page(struct buffer_page **bpage)
 	*bpage = list_entry(p, struct buffer_page, list);
 }
 
+static inline void rb_dec_page(struct buffer_page **bpage)
+{
+	struct list_head *p = rb_list_head((*bpage)->list.prev);
+
+	*bpage = list_entry(p, struct buffer_page, list);
+}
+
 static struct buffer_page *
 rb_set_head_page(struct ring_buffer_per_cpu *cpu_buffer)
 {
@@ -1862,13 +1869,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
 	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
 }
 
+static void rb_update_meta_reader(struct ring_buffer_per_cpu *cpu_buffer,
+				  struct buffer_page *reader);
+
 /* If the meta data has been validated, now validate the events */
 static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
-	struct buffer_page *head_page;
+	struct buffer_page *head_page, *orig_head;
 	unsigned long entry_bytes = 0;
 	unsigned long entries = 0;
+	u64 ts;
 	int ret;
 	int i;
 
@@ -1885,8 +1896,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
 	local_set(&cpu_buffer->reader_page->entries, ret);
 
-	head_page = cpu_buffer->head_page;
-
 	/* If the commit_buffer is the reader page, update the commit page */
 	if (meta->commit_buffer == (unsigned long)cpu_buffer->reader_page->page) {
 		cpu_buffer->commit_page = cpu_buffer->reader_page;
@@ -1894,6 +1903,88 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 		goto done;
 	}
 
+	orig_head = head_page = cpu_buffer->head_page;
+	ts = cpu_buffer->reader_page->page->time_stamp;
+
+	/*
+	 * Try to rewind the head so that we can read the pages which already
+	 * read in the previous boot.
+	 */
+	if (head_page != cpu_buffer->tail_page)
+		rb_dec_page(&head_page);
+
+	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_dec_page(&head_page)) {
+
+		/* Rewind until tail (writer) page. */
+		if (head_page == cpu_buffer->tail_page)
+			break;
+
+		if (ts < head_page->page->time_stamp)
+			break;
+
+		ts = head_page->page->time_stamp;
+		if (!ts)
+			break;
+
+		if (rb_page_commit(head_page) == 0)
+			break;
+
+		/* Stop rewind if the page is invalid. */
+		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+		if (ret < 0)
+			break;
+
+		/* Recover the number of entries. */
+		local_set(&head_page->entries, ret);
+		if (ret)
+			local_inc(&cpu_buffer->pages_touched);
+		entries += ret;
+		entry_bytes += rb_page_commit(head_page);
+	}
+
+	/* The last rewind page must be skipped. */
+	if (head_page != orig_head)
+		rb_inc_page(&head_page);
+
+	if (head_page != orig_head) {
+		struct buffer_page *bpage = orig_head;
+
+		rb_dec_page(&bpage);
+		/*
+		 * Move the reader page between the orig_head and the page
+		 * before it.
+		 */
+		cpu_buffer->reader_page->list.next = &orig_head->list;
+		cpu_buffer->reader_page->list.prev = orig_head->list.prev;
+		orig_head->list.prev = &cpu_buffer->reader_page->list;
+
+		bpage->list.next = &cpu_buffer->reader_page->list;
+
+		/* Make the head_page the new reader page */
+		cpu_buffer->reader_page = head_page;
+		bpage = head_page;
+		rb_inc_page(&head_page);
+		head_page->list.prev = bpage->list.prev;
+		rb_dec_page(&bpage);
+		bpage->list.next = &head_page->list;
+		rb_set_list_to_head(&bpage->list);
+
+		cpu_buffer->head_page = head_page;
+		meta->head_buffer = (unsigned long)head_page->page;
+
+		/* Reset all the indexes */
+		bpage = cpu_buffer->reader_page;
+		meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
+		bpage->id = 0;
+
+		for (i = 0, bpage = head_page; i < meta->nr_subbufs;
+		     i++, rb_inc_page(&bpage)) {
+			meta->buffers[i + 1] = rb_meta_subbuf_idx(meta, bpage->page);
+			bpage->id = i + 1;
+		}
+		head_page = orig_head;
+	}
+
 	/* Iterate until finding the commit page */
 	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
 
@@ -5348,7 +5439,6 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	 */
 	local_set(&cpu_buffer->reader_page->write, 0);
 	local_set(&cpu_buffer->reader_page->entries, 0);
-	local_set(&cpu_buffer->reader_page->page->commit, 0);
 	cpu_buffer->reader_page->real_end = 0;
 
  spin:
@@ -6642,7 +6732,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 		cpu_buffer->read_bytes += rb_page_size(reader);
 
 		/* swap the pages */
-		rb_init_page(bpage);
+//		rb_init_page(bpage);
 		bpage = reader->page;
 		reader->page = data_page->data;
 		local_set(&reader->write, 0);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cf51c30b137f..2a060c62d686 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6066,6 +6066,7 @@ struct trace_mod_entry {
 };
 
 struct trace_scratch {
+	int			clock_id;
 	unsigned long		text_addr;
 	unsigned long		nr_entries;
 	struct trace_mod_entry	entries[];
@@ -6181,6 +6182,7 @@ static void update_last_data(struct trace_array *tr)
 	if (tr->scratch) {
 		struct trace_scratch *tscratch = tr->scratch;
 
+		tscratch->clock_id = tr->clock_id;
 		memset(tscratch->entries, 0,
 		       flex_array_size(tscratch, entries, tscratch->nr_entries));
 		tscratch->nr_entries = 0;
@@ -7403,6 +7405,12 @@ int tracing_set_clock(struct trace_array *tr, const char *clockstr)
 	tracing_reset_online_cpus(&tr->max_buffer);
 #endif
 
+	if (tr->scratch && !(tr->flags & TRACE_ARRAY_FL_LAST_BOOT)) {
+		struct trace_scratch *tscratch = tr->scratch;
+
+		tscratch->clock_id = i;
+	}
+
 	mutex_unlock(&trace_types_lock);
 
 	return 0;
@@ -9628,6 +9636,14 @@ static void setup_trace_scratch(struct trace_array *tr,
 
 	/* Scan modules to make text delta for modules. */
 	module_for_each_mod(make_mod_delta, tr);
+
+	/* Set trace_clock as the same of the previous boot. */
+	if (tscratch->clock_id != tr->clock_id) {
+		if (tracing_set_clock(tr, trace_clocks[tscratch->clock_id].name) < 0) {
+			pr_info("the previous trace_clock info is not valid.");
+			goto reset;
+		}
+	}
 	return;
  reset:
 	/* Invalid trace modules */

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
  2025-05-14  0:32 ` Steven Rostedt
@ 2025-05-14  0:36   ` Steven Rostedt
  2025-05-14  4:33     ` Masami Hiramatsu
  2025-05-14  6:00   ` Masami Hiramatsu
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2025-05-14  0:36 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel

On Tue, 13 May 2025 20:32:37 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> @@ -9628,6 +9636,14 @@ static void setup_trace_scratch(struct trace_array *tr,
>  
>  	/* Scan modules to make text delta for modules. */
>  	module_for_each_mod(make_mod_delta, tr);
> +
> +	/* Set trace_clock as the same of the previous boot. */
> +	if (tscratch->clock_id != tr->clock_id) {
> +		if (tracing_set_clock(tr, trace_clocks[tscratch->clock_id].name) < 0) {
> +			pr_info("the previous trace_clock info is not valid.");
> +			goto reset;
> +		}
> +	}
>  	return;
>   reset:
>  	/* Invalid trace modules */


Hmm, somehow I got your other patch in here :-D

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
  2025-05-14  0:36   ` Steven Rostedt
@ 2025-05-14  4:33     ` Masami Hiramatsu
  0 siblings, 0 replies; 7+ messages in thread
From: Masami Hiramatsu @ 2025-05-14  4:33 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel

On Tue, 13 May 2025 20:36:47 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 13 May 2025 20:32:37 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > @@ -9628,6 +9636,14 @@ static void setup_trace_scratch(struct trace_array *tr,
> >  
> >  	/* Scan modules to make text delta for modules. */
> >  	module_for_each_mod(make_mod_delta, tr);
> > +
> > +	/* Set trace_clock as the same of the previous boot. */
> > +	if (tscratch->clock_id != tr->clock_id) {
> > +		if (tracing_set_clock(tr, trace_clocks[tscratch->clock_id].name) < 0) {
> > +			pr_info("the previous trace_clock info is not valid.");
> > +			goto reset;
> > +		}
> > +	}
> >  	return;
> >   reset:
> >  	/* Invalid trace modules */
> 
> 
> Hmm, somehow I got your other patch in here :-D

Oops, good catch! I mixed other patch there :-(


> 
> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
  2025-05-14  0:32 ` Steven Rostedt
  2025-05-14  0:36   ` Steven Rostedt
@ 2025-05-14  6:00   ` Masami Hiramatsu
  2025-05-14 13:00     ` Steven Rostedt
  1 sibling, 1 reply; 7+ messages in thread
From: Masami Hiramatsu @ 2025-05-14  6:00 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel

On Tue, 13 May 2025 20:32:37 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 13 May 2025 09:50:27 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Rewind persistent ring buffer pages which have been read in the
> > previous boot. Those pages are highly possible to be lost before
> > writing it to the disk if the previous kernel crashed. In this
> > case, the trace data is kept on the persistent ring buffer, but
> > it can not be read because its commit size has been reset after
> > read.
> > This skips clearing the commit size of each sub-buffer and
> > recover it after reboot.
> > 
> > Note: If you read the previous boot data via trace_pipe, that
> > is not accessible in that time. But reboot without clearing (or
> > reusing) the read data, the read data is recovered again in the
> > next boot.
> > Thus, when you read the previous boot data, clear it by
> > `echo > trace`.
> 
> Is that a problem? I'm thinking that the data in the buffer should not be
> used.

Yes, even if we read (dump) the previous boot data, the data is
in the buffer. Thus the kernel rebooted before reusing the buffer
the dumped pages are recovered again. Unless comparing with the
previous dump data, we can not know this data is older boot or not.
Anyway, user can avoid this issue by clearing the trace buffer
explicitly.


> Anyway, I had to fix a bug with trace_pipe:
> 
>   https://lore.kernel.org/linux-trace-kernel/20250513115032.3e0b97f7@gandalf.local.home/
> 
> And with that applied, it doesn't rewind all the way, but still gives you
> what's left on the reader page. I think that is the case without this patch
> too.

OK.

> 
> > 
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >  kernel/trace/ring_buffer.c |   55
> > ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 52
> > insertions(+), 3 deletions(-)
> 
> Anyway, I found that this is a lot more complex than what you have here. As
> the commit size is set to zero when added back into the ring buffer, I
> needed to remove that. But things get much more trickier than that!
> 
> First, we need to make the new head page found (the one that was read
> already) the new "reader page". And we need to insert the old reader page
> back into the ring buffer. And reset everything to handle all of this.

Yeah, I noticed the timestamps were messed up with my patch. That will
be sorted if we use external tools like perfetto, but still not good
for tracefs users.

> 
> I played a little with it, and came up with this. It's still a bit buggy,
> but it does give you an idea of what needs to be done.

Thanks! I will work on this version.

> 
> -- Steve
> 
> This is on top of the trace_pipe fix in the link above.
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 6859008ca34d..f5dd96ca25c9 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1358,6 +1358,13 @@ static inline void rb_inc_page(struct buffer_page **bpage)
>  	*bpage = list_entry(p, struct buffer_page, list);
>  }
>  
> +static inline void rb_dec_page(struct buffer_page **bpage)
> +{
> +	struct list_head *p = rb_list_head((*bpage)->list.prev);
> +
> +	*bpage = list_entry(p, struct buffer_page, list);
> +}
> +
>  static struct buffer_page *
>  rb_set_head_page(struct ring_buffer_per_cpu *cpu_buffer)
>  {
> @@ -1862,13 +1869,17 @@ static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
>  	return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
>  }
>  
> +static void rb_update_meta_reader(struct ring_buffer_per_cpu *cpu_buffer,
> +				  struct buffer_page *reader);
> +
>  /* If the meta data has been validated, now validate the events */
>  static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
>  {
>  	struct ring_buffer_cpu_meta *meta = cpu_buffer->ring_meta;
> -	struct buffer_page *head_page;
> +	struct buffer_page *head_page, *orig_head;
>  	unsigned long entry_bytes = 0;
>  	unsigned long entries = 0;
> +	u64 ts;
>  	int ret;
>  	int i;
>  
> @@ -1885,8 +1896,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
>  	entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
>  	local_set(&cpu_buffer->reader_page->entries, ret);
>  
> -	head_page = cpu_buffer->head_page;
> -
>  	/* If the commit_buffer is the reader page, update the commit page */
>  	if (meta->commit_buffer == (unsigned long)cpu_buffer->reader_page->page) {
>  		cpu_buffer->commit_page = cpu_buffer->reader_page;
> @@ -1894,6 +1903,88 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
>  		goto done;
>  	}
>  
> +	orig_head = head_page = cpu_buffer->head_page;
> +	ts = cpu_buffer->reader_page->page->time_stamp;

Nice catch! ts should be better here.

> +
> +	/*
> +	 * Try to rewind the head so that we can read the pages which already
> +	 * read in the previous boot.
> +	 */
> +	if (head_page != cpu_buffer->tail_page)
> +		rb_dec_page(&head_page);
> +
> +	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_dec_page(&head_page)) {
> +
> +		/* Rewind until tail (writer) page. */
> +		if (head_page == cpu_buffer->tail_page)
> +			break;
> +
> +		if (ts < head_page->page->time_stamp)
> +			break;
> +
> +		ts = head_page->page->time_stamp;
> +		if (!ts)
> +			break;
> +
> +		if (rb_page_commit(head_page) == 0)
> +			break;

Ah, OK. the single 0 size page can be the end.

> +
> +		/* Stop rewind if the page is invalid. */
> +		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
> +		if (ret < 0)
> +			break;
> +
> +		/* Recover the number of entries. */
> +		local_set(&head_page->entries, ret);
> +		if (ret)
> +			local_inc(&cpu_buffer->pages_touched);
> +		entries += ret;
> +		entry_bytes += rb_page_commit(head_page);

If we validate the pages again later (because fixing head_page),
we can skip this part.

> +	}
> +
> +	/* The last rewind page must be skipped. */
> +	if (head_page != orig_head)
> +		rb_inc_page(&head_page);
> +
> +	if (head_page != orig_head) {

Ah, I forgot this part (setup new reader_page)

> +		struct buffer_page *bpage = orig_head;
> +
> +		rb_dec_page(&bpage);
> +		/*
> +		 * Move the reader page between the orig_head and the page
> +		 * before it.
> +		 */
-----
> +		cpu_buffer->reader_page->list.next = &orig_head->list;
> +		cpu_buffer->reader_page->list.prev = orig_head->list.prev;
> +		orig_head->list.prev = &cpu_buffer->reader_page->list;
> +
> +		bpage->list.next = &cpu_buffer->reader_page->list;
-----
These seems the same as (because head_page->list.prev->next encodes
flags, but we don't read that pointer.);

		list_insert(&orig_head->list, &cpu_buffer->reader_page->list);

> +
> +		/* Make the head_page the new reader page */
> +		cpu_buffer->reader_page = head_page;
> +		bpage = head_page;
> +		rb_inc_page(&head_page);
> +		head_page->list.prev = bpage->list.prev;
> +		rb_dec_page(&bpage);
> +		bpage->list.next = &head_page->list;
> +		rb_set_list_to_head(&bpage->list);
> +
> +		cpu_buffer->head_page = head_page;
> +		meta->head_buffer = (unsigned long)head_page->page;
> +
> +		/* Reset all the indexes */
> +		bpage = cpu_buffer->reader_page;
> +		meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
> +		bpage->id = 0;
> +
> +		for (i = 0, bpage = head_page; i < meta->nr_subbufs;
> +		     i++, rb_inc_page(&bpage)) {
> +			meta->buffers[i + 1] = rb_meta_subbuf_idx(meta, bpage->page);
> +			bpage->id = i + 1;
> +		}
> +		head_page = orig_head;
> +	}
> +
>  	/* Iterate until finding the commit page */
>  	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
>  
> @@ -5348,7 +5439,6 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	 */
>  	local_set(&cpu_buffer->reader_page->write, 0);
>  	local_set(&cpu_buffer->reader_page->entries, 0);
> -	local_set(&cpu_buffer->reader_page->page->commit, 0);
>  	cpu_buffer->reader_page->real_end = 0;
>  
>   spin:
> @@ -6642,7 +6732,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  		cpu_buffer->read_bytes += rb_page_size(reader);
>  
>  		/* swap the pages */
> -		rb_init_page(bpage);
> +//		rb_init_page(bpage);
>  		bpage = reader->page;
>  		reader->page = data_page->data;
>  		local_set(&reader->write, 0);

Thank you,



-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
  2025-05-14  6:00   ` Masami Hiramatsu
@ 2025-05-14 13:00     ` Steven Rostedt
  2025-05-15  0:42       ` Masami Hiramatsu
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2025-05-14 13:00 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel

On Wed, 14 May 2025 15:00:59 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > 
> > Is that a problem? I'm thinking that the data in the buffer should not be
> > used.  
> 
> Yes, even if we read (dump) the previous boot data, the data is
> in the buffer. Thus the kernel rebooted before reusing the buffer
> the dumped pages are recovered again. Unless comparing with the
> previous dump data, we can not know this data is older boot or not.
> Anyway, user can avoid this issue by clearing the trace buffer
> explicitly.

What we could do, and I don't think this would be too hard, is once the
buffer is empty and it's still LAST_BOOT buffer, we simply clear it in
the kernel.

That way after a reboot, a read of trace_pipe that reads the entire
buffer will end up resetting the buffer, and I think that will solve
this problem.



> > +
> > +		/* Stop rewind if the page is invalid. */
> > +		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
> > +		if (ret < 0)
> > +			break;
> > +
> > +		/* Recover the number of entries. */
> > +		local_set(&head_page->entries, ret);
> > +		if (ret)
> > +			local_inc(&cpu_buffer->pages_touched);
> > +		entries += ret;
> > +		entry_bytes += rb_page_commit(head_page);  
> 
> If we validate the pages again later (because fixing head_page),
> we can skip this part.

The validator takes a bit of time. I would rather not do another loop
if we don't have to. If this is duplicate code, lets just make a static
inline helper function that does it and use that in both places.

> 
> > +	}
> > +
> > +	/* The last rewind page must be skipped. */
> > +	if (head_page != orig_head)
> > +		rb_inc_page(&head_page);
> > +
> > +	if (head_page != orig_head) {  
> 
> Ah, I forgot this part (setup new reader_page)
> 
> > +		struct buffer_page *bpage = orig_head;
> > +
> > +		rb_dec_page(&bpage);
> > +		/*
> > +		 * Move the reader page between the orig_head and the page
> > +		 * before it.
> > +		 */  
> -----
> > +		cpu_buffer->reader_page->list.next = &orig_head->list;
> > +		cpu_buffer->reader_page->list.prev = orig_head->list.prev;
> > +		orig_head->list.prev = &cpu_buffer->reader_page->list;
> > +
> > +		bpage->list.next = &cpu_buffer->reader_page->list;  
> -----
> These seems the same as (because head_page->list.prev->next encodes
> flags, but we don't read that pointer.);
> 
> 		list_insert(&orig_head->list, &cpu_buffer->reader_page->list);

I thought about this, but because the pointers are used to encode
flags, I try to avoid using the list_*() functions all together on
these. Just to remind everyone that these are "special" lists.

I prefer it open coded because that way I can see exactly what it is
doing. Note, this is not just assigning pointers, it is also clearing
flags in the process.

We could add a comment that states something like:

	/*
	 * This is the same as:
	 *   list_insert(&orig_head->list, &cpu_buffer->read_page->list);
	 * but as it is also clearing flags, its open coded so that
	 * there's no chance that list_insert() gets optimized where
	 * it doesn't do the extra work that this is doing.
	 */

?

-- Steve


> 
> > +
> > +		/* Make the head_page the new reader page */
> > +		cpu_buffer->reader_page = head_page;
> > +		bpage = head_page;
> > +		rb_inc_page(&head_page);
> > +		head_page->list.prev = bpage->list.prev;
> > +		rb_dec_page(&bpage);
> > +		bpage->list.next = &head_page->list;
> > +		rb_set_list_to_head(&bpage->list);
> > +
> > +		cpu_buffer->head_page = head_page;
> > +		meta->head_buffer = (unsigned long)head_page->page;
> > +
> > +		/* Reset all the indexes */
> > +		bpage = cpu_buffer->reader_page;
> > +		meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
> > +		bpage->id = 0;
> > +
> > +		for (i = 0, bpage = head_page; i < meta->nr_subbufs;
> > +		     i++, rb_inc_page(&bpage)) {
> > +			meta->buffers[i + 1] = rb_meta_subbuf_idx(meta, bpage->page);
> > +			bpage->id = i + 1;
> > +		}
> > +		head_page = orig_head;
> > +	}
> > +
> >  	/* Iterate until finding the commit page */
> >  	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
> >  
> > @@ -5348,7 +5439,6 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> >  	 */
> >  	local_set(&cpu_buffer->reader_page->write, 0);
> >  	local_set(&cpu_buffer->reader_page->entries, 0);
> > -	local_set(&cpu_buffer->reader_page->page->commit, 0);
> >  	cpu_buffer->reader_page->real_end = 0;
> >  
> >   spin:
> > @@ -6642,7 +6732,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> >  		cpu_buffer->read_bytes += rb_page_size(reader);
> >  
> >  		/* swap the pages */
> > -		rb_init_page(bpage);
> > +//		rb_init_page(bpage);
> >  		bpage = reader->page;
> >  		reader->page = data_page->data;
> >  		local_set(&reader->write, 0);  
> 
> Thank you,
> 
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot
  2025-05-14 13:00     ` Steven Rostedt
@ 2025-05-15  0:42       ` Masami Hiramatsu
  0 siblings, 0 replies; 7+ messages in thread
From: Masami Hiramatsu @ 2025-05-15  0:42 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel

On Wed, 14 May 2025 09:00:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 14 May 2025 15:00:59 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > > 
> > > Is that a problem? I'm thinking that the data in the buffer should not be
> > > used.  
> > 
> > Yes, even if we read (dump) the previous boot data, the data is
> > in the buffer. Thus the kernel rebooted before reusing the buffer
> > the dumped pages are recovered again. Unless comparing with the
> > previous dump data, we can not know this data is older boot or not.
> > Anyway, user can avoid this issue by clearing the trace buffer
> > explicitly.
> 
> What we could do, and I don't think this would be too hard, is once the
> buffer is empty and it's still LAST_BOOT buffer, we simply clear it in
> the kernel.

Ah, that sounds good :-D

> 
> That way after a reboot, a read of trace_pipe that reads the entire
> buffer will end up resetting the buffer, and I think that will solve
> this problem.
> 
> 
> 
> > > +
> > > +		/* Stop rewind if the page is invalid. */
> > > +		ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
> > > +		if (ret < 0)
> > > +			break;
> > > +
> > > +		/* Recover the number of entries. */
> > > +		local_set(&head_page->entries, ret);
> > > +		if (ret)
> > > +			local_inc(&cpu_buffer->pages_touched);
> > > +		entries += ret;
> > > +		entry_bytes += rb_page_commit(head_page);  
> > 
> > If we validate the pages again later (because fixing head_page),
> > we can skip this part.
> 
> The validator takes a bit of time. I would rather not do another loop
> if we don't have to. If this is duplicate code, lets just make a static
> inline helper function that does it and use that in both places.

OK, I think we can just restart validating unread part from
orig_head.

> 
> > 
> > > +	}
> > > +
> > > +	/* The last rewind page must be skipped. */
> > > +	if (head_page != orig_head)
> > > +		rb_inc_page(&head_page);
> > > +
> > > +	if (head_page != orig_head) {  
> > 
> > Ah, I forgot this part (setup new reader_page)
> > 
> > > +		struct buffer_page *bpage = orig_head;
> > > +
> > > +		rb_dec_page(&bpage);
> > > +		/*
> > > +		 * Move the reader page between the orig_head and the page
> > > +		 * before it.
> > > +		 */  
> > -----
> > > +		cpu_buffer->reader_page->list.next = &orig_head->list;
> > > +		cpu_buffer->reader_page->list.prev = orig_head->list.prev;
> > > +		orig_head->list.prev = &cpu_buffer->reader_page->list;
> > > +
> > > +		bpage->list.next = &cpu_buffer->reader_page->list;  
> > -----
> > These seems the same as (because head_page->list.prev->next encodes
> > flags, but we don't read that pointer.);
> > 
> > 		list_insert(&orig_head->list, &cpu_buffer->reader_page->list);
> 
> I thought about this, but because the pointers are used to encode
> flags, I try to avoid using the list_*() functions all together on
> these. Just to remind everyone that these are "special" lists.
> 
> I prefer it open coded because that way I can see exactly what it is
> doing. Note, this is not just assigning pointers, it is also clearing
> flags in the process.

OK. And I found list_insert() is not in the kernel.
(tools/firmware/list.h has that)

> 
> We could add a comment that states something like:
> 
> 	/*
> 	 * This is the same as:
> 	 *   list_insert(&orig_head->list, &cpu_buffer->read_page->list);
> 	 * but as it is also clearing flags, its open coded so that
> 	 * there's no chance that list_insert() gets optimized where
> 	 * it doesn't do the extra work that this is doing.
> 	 */
> 
> ?

Yeah, anyway I will leave a comment.

Thank you,

> 
> -- Steve
> 
> 
> > 
> > > +
> > > +		/* Make the head_page the new reader page */
> > > +		cpu_buffer->reader_page = head_page;
> > > +		bpage = head_page;
> > > +		rb_inc_page(&head_page);
> > > +		head_page->list.prev = bpage->list.prev;
> > > +		rb_dec_page(&bpage);
> > > +		bpage->list.next = &head_page->list;
> > > +		rb_set_list_to_head(&bpage->list);
> > > +
> > > +		cpu_buffer->head_page = head_page;
> > > +		meta->head_buffer = (unsigned long)head_page->page;
> > > +
> > > +		/* Reset all the indexes */
> > > +		bpage = cpu_buffer->reader_page;
> > > +		meta->buffers[0] = rb_meta_subbuf_idx(meta, bpage->page);
> > > +		bpage->id = 0;
> > > +
> > > +		for (i = 0, bpage = head_page; i < meta->nr_subbufs;
> > > +		     i++, rb_inc_page(&bpage)) {
> > > +			meta->buffers[i + 1] = rb_meta_subbuf_idx(meta, bpage->page);
> > > +			bpage->id = i + 1;
> > > +		}
> > > +		head_page = orig_head;
> > > +	}
> > > +
> > >  	/* Iterate until finding the commit page */
> > >  	for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) {
> > >  
> > > @@ -5348,7 +5439,6 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> > >  	 */
> > >  	local_set(&cpu_buffer->reader_page->write, 0);
> > >  	local_set(&cpu_buffer->reader_page->entries, 0);
> > > -	local_set(&cpu_buffer->reader_page->page->commit, 0);
> > >  	cpu_buffer->reader_page->real_end = 0;
> > >  
> > >   spin:
> > > @@ -6642,7 +6732,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> > >  		cpu_buffer->read_bytes += rb_page_size(reader);
> > >  
> > >  		/* swap the pages */
> > > -		rb_init_page(bpage);
> > > +//		rb_init_page(bpage);
> > >  		bpage = reader->page;
> > >  		reader->page = data_page->data;
> > >  		local_set(&reader->write, 0);  
> > 
> > Thank you,
> > 
> > 
> > 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-15  0:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-13  0:50 [RFC PATCH] tracing: ring_buffer: Rewind persistent ring buffer when reboot Masami Hiramatsu (Google)
2025-05-14  0:32 ` Steven Rostedt
2025-05-14  0:36   ` Steven Rostedt
2025-05-14  4:33     ` Masami Hiramatsu
2025-05-14  6:00   ` Masami Hiramatsu
2025-05-14 13:00     ` Steven Rostedt
2025-05-15  0:42       ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).