linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] tracing: Clean up persistent ring buffer code
@ 2025-04-01 22:58 Steven Rostedt
  2025-04-01 22:58 ` [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Steven Rostedt @ 2025-04-01 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Linus Torvalds, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Vincent Donnefort, Vlastimil Babka, Mike Rapoport,
	Jann Horn

Now that I learned that the memory passed back from reserve_mem is part
of the memory allocator and just "reserved" and the memory is already
virtually mapped, it can simply use phys_to_virt() on the physical memory
that is returned to get the virtual mapping for that memory!
  (Thanks Mike!)

That makes things much easier, especially since it means that the memory
returned by reserve_mem is no different than the memory retrieved by
page_alloc(). This allows that memory to be memory mapped to user space
no differently than it is mapped by the normal buffer.

This new series does the following:

- Enforce the memory mapping is page aligned (both the address and the
  size). If not, it errors out.

- Use phys_to_virt() to get to the virtual memory from the reserve_mem
  returned addresses. As the memory is already freed via
  reserve_mem_release_by_name() and it's not mapped by vmap() anymore,
  the free ring buffer code doesn't need to do anything special for
  this mapping.

- Treat the buffer allocated via memmap differently. It still needs to
  be virtually mapped (cannot use phys_to_virt) and it must not be
  freed nor memory mapped to user space. A new flag is added when a buffer
  is created this way to prevent it from ever being memory mapped to user
  space and the ref count is upped so that it can never be freed.

- Use vmap_page_range() instead of using kmalloc_array() to create an array
  of struct pages for vmap().

- Use flush_kernel_vmap_range() instead of flush_dcache_folio()

Changes since v4: https://lore.kernel.org/linux-trace-kernel/20250401215115.602501043@goodmis.org/

- Fixed calling free_reserve_area() twice on the same memory


Steven Rostedt (4):
      tracing: Enforce the persistent ring buffer to be page aligned
      tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
      tracing: Use vmap_page_range() to map memmap ring buffer
      ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()

----
 Documentation/admin-guide/kernel-parameters.txt |  2 +
 Documentation/trace/debugging.rst               |  2 +
 kernel/trace/ring_buffer.c                      |  5 +-
 kernel/trace/trace.c                            | 68 ++++++++++++++++---------
 kernel/trace/trace.h                            |  1 +
 5 files changed, 52 insertions(+), 26 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned
  2025-04-01 22:58 [PATCH v5 0/4] tracing: Clean up persistent ring buffer code Steven Rostedt
@ 2025-04-01 22:58 ` Steven Rostedt
  2025-04-02  9:21   ` Mike Rapoport
  2025-04-01 22:58 ` [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-01 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Linus Torvalds, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Vincent Donnefort, Vlastimil Babka, Mike Rapoport,
	Jann Horn

From: Steven Rostedt <rostedt@goodmis.org>

Enforce that the address and the size of the memory used by the persistent
ring buffer is page aligned. Also update the documentation to reflect this
requirement.

Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  2 ++
 Documentation/trace/debugging.rst               |  2 ++
 kernel/trace/trace.c                            | 12 ++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3435a062a208..f904fd8481bd 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7266,6 +7266,8 @@
 			This is just one of many ways that can clear memory. Make sure your system
 			keeps the content of memory across reboots before relying on this option.
 
+			NB: Both the mapped address and size must be page aligned for the architecture.
+
 			See also Documentation/trace/debugging.rst
 
 
diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
index 54fb16239d70..d54bc500af80 100644
--- a/Documentation/trace/debugging.rst
+++ b/Documentation/trace/debugging.rst
@@ -136,6 +136,8 @@ kernel, so only the same kernel is guaranteed to work if the mapping is
 preserved. Switching to a different kernel version may find a different
 layout and mark the buffer as invalid.
 
+NB: Both the mapped address and size must be page aligned for the architecture.
+
 Using trace_printk() in the boot instance
 -----------------------------------------
 By default, the content of trace_printk() goes into the top level tracing
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index de6d7f0e6206..de9c237e5826 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10788,6 +10788,18 @@ __init static void enable_instances(void)
 		}
 
 		if (start) {
+			/* Start and size must be page aligned */
+			if (start & ~PAGE_MASK) {
+				pr_warn("Tracing: mapping start addr %lx is not page aligned\n",
+					(unsigned long)start);
+				continue;
+			}
+			if (size & ~PAGE_MASK) {
+				pr_warn("Tracing: mapping size %lx is not page aligned\n",
+					(unsigned long)size);
+				continue;
+			}
+
 			addr = map_pages(start, size);
 			if (addr) {
 				pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
  2025-04-01 22:58 [PATCH v5 0/4] tracing: Clean up persistent ring buffer code Steven Rostedt
  2025-04-01 22:58 ` [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned Steven Rostedt
@ 2025-04-01 22:58 ` Steven Rostedt
  2025-04-02  9:24   ` Mike Rapoport
  2025-04-01 22:58 ` [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer Steven Rostedt
  2025-04-01 22:58 ` [PATCH v5 4/4] ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio() Steven Rostedt
  3 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-01 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Linus Torvalds, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Vincent Donnefort, Vlastimil Babka, Mike Rapoport,
	Jann Horn

From: Steven Rostedt <rostedt@goodmis.org>

The reserve_mem kernel command line option may pass back a physical
address, but the memory is still part of the normal memory just like
using memblock_reserve() would be. This means that the physical memory
returned by the reserve_mem command line option can be converted directly
to virtual memory by simply using phys_to_virt().

When freeing the buffer there's no need to call vunmap() anymore as the
memory allocated by reserve_mem is freed by the call to
reserve_mem_release_by_name().

Because the persistent ring buffer can also be allocated via the memmap
option, which *is* different than normal memory as it cannot be added back
to the buddy system, it must be treated differently. It still needs to be
virtually mapped to have access to it. It also can not be freed nor can it
ever be memory mapped to user space.

Create a new trace_array flag called TRACE_ARRAY_FL_MEMMAP which gets set
if the buffer is created by the memmap option, and this will prevent the
buffer from being memory mapped by user space.

Also increment the ref count for memmap'ed buffers so that they can never
be freed.

Link: https://lore.kernel.org/all/Z-wFszhJ_9o4dc8O@kernel.org/

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 23 ++++++++++++++++-------
 kernel/trace/trace.h |  1 +
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index de9c237e5826..2f9c91f26d5b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8505,6 +8505,10 @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
 	struct trace_iterator *iter = &info->iter;
 	int ret = 0;
 
+	/* A memmap'ed buffer is not supported for user space mmap */
+	if (iter->tr->flags & TRACE_ARRAY_FL_MEMMAP)
+		return -ENODEV;
+
 	/* Currently the boot mapped buffer is not supported for mmap */
 	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
 		return -ENODEV;
@@ -9614,9 +9618,6 @@ static void free_trace_buffers(struct trace_array *tr)
 #ifdef CONFIG_TRACER_MAX_TRACE
 	free_trace_buffer(&tr->max_buffer);
 #endif
-
-	if (tr->range_addr_start)
-		vunmap((void *)tr->range_addr_start);
 }
 
 static void init_trace_flags_index(struct trace_array *tr)
@@ -10710,6 +10711,7 @@ static inline void do_allocate_snapshot(const char *name) { }
 __init static void enable_instances(void)
 {
 	struct trace_array *tr;
+	bool memmap_area = false;
 	char *curr_str;
 	char *name;
 	char *str;
@@ -10778,6 +10780,7 @@ __init static void enable_instances(void)
 					name);
 				continue;
 			}
+			memmap_area = true;
 		} else if (tok) {
 			if (!reserve_mem_find_by_name(tok, &start, &size)) {
 				start = 0;
@@ -10800,7 +10803,10 @@ __init static void enable_instances(void)
 				continue;
 			}
 
-			addr = map_pages(start, size);
+			if (memmap_area)
+				addr = map_pages(start, size);
+			else
+				addr = (unsigned long)phys_to_virt(start);
 			if (addr) {
 				pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
 					name, &start, (unsigned long)size);
@@ -10827,10 +10833,13 @@ __init static void enable_instances(void)
 			update_printk_trace(tr);
 
 		/*
-		 * If start is set, then this is a mapped buffer, and
-		 * cannot be deleted by user space, so keep the reference
-		 * to it.
+		 * memmap'd buffers can not be freed.
 		 */
+		if (memmap_area) {
+			tr->flags |= TRACE_ARRAY_FL_MEMMAP;
+			tr->ref++;
+		}
+
 		if (start) {
 			tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT;
 			tr->range_name = no_free_ptr(rname);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c20f6bcc200a..f9513dc14c37 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -447,6 +447,7 @@ enum {
 	TRACE_ARRAY_FL_BOOT		= BIT(1),
 	TRACE_ARRAY_FL_LAST_BOOT	= BIT(2),
 	TRACE_ARRAY_FL_MOD_INIT		= BIT(3),
+	TRACE_ARRAY_FL_MEMMAP		= BIT(4),
 };
 
 #ifdef CONFIG_MODULES
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-01 22:58 [PATCH v5 0/4] tracing: Clean up persistent ring buffer code Steven Rostedt
  2025-04-01 22:58 ` [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned Steven Rostedt
  2025-04-01 22:58 ` [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer Steven Rostedt
@ 2025-04-01 22:58 ` Steven Rostedt
  2025-04-02 16:42   ` Linus Torvalds
  2025-04-01 22:58 ` [PATCH v5 4/4] ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio() Steven Rostedt
  3 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-01 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Linus Torvalds, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Vincent Donnefort, Vlastimil Babka, Mike Rapoport,
	Jann Horn

From: Steven Rostedt <rostedt@goodmis.org>

The code to map the physical memory retrieved by memmap currently
allocates an array of pages to cover the physical memory and then calls
vmap() to map it to a virtual address. Instead of using this temporary
array of struct page descriptors, simply use vmap_page_range() that can
directly map the contiguous physical memory to a virtual address.

Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2f9c91f26d5b..e30d1e058fea 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -50,6 +50,7 @@
 #include <linux/irq_work.h>
 #include <linux/workqueue.h>
 #include <linux/sort.h>
+#include <linux/io.h> /* vmap_page_range() */
 
 #include <asm/setup.h> /* COMMAND_LINE_SIZE */
 
@@ -9810,29 +9811,27 @@ static int instance_mkdir(const char *name)
 	return ret;
 }
 
-static u64 map_pages(u64 start, u64 size)
+static u64 map_pages(unsigned long start, unsigned long size)
 {
-	struct page **pages;
-	phys_addr_t page_start;
-	unsigned int page_count;
-	unsigned int i;
-	void *vaddr;
-
-	page_count = DIV_ROUND_UP(size, PAGE_SIZE);
+	unsigned long vmap_start, vmap_end;
+	struct vm_struct *area;
+	int ret;
 
-	page_start = start;
-	pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
-	if (!pages)
+	area = get_vm_area(size, VM_IOREMAP);
+	if (!area)
 		return 0;
 
-	for (i = 0; i < page_count; i++) {
-		phys_addr_t addr = page_start + i * PAGE_SIZE;
-		pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
+	vmap_start = (unsigned long) area->addr;
+	vmap_end = vmap_start + size;
+
+	ret = vmap_page_range(vmap_start, vmap_end,
+			      start, pgprot_nx(PAGE_KERNEL));
+	if (ret < 0) {
+		free_vm_area(area);
+		return 0;
 	}
-	vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
-	kfree(pages);
 
-	return (u64)(unsigned long)vaddr;
+	return (u64)vmap_start;
 }
 
 /**
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/4] ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
  2025-04-01 22:58 [PATCH v5 0/4] tracing: Clean up persistent ring buffer code Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-04-01 22:58 ` [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer Steven Rostedt
@ 2025-04-01 22:58 ` Steven Rostedt
  3 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2025-04-01 22:58 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Linus Torvalds, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Vincent Donnefort, Vlastimil Babka, Mike Rapoport,
	Jann Horn, stable

From: Steven Rostedt <rostedt@goodmis.org>

Some architectures do not have data cache coherency between user and
kernel space. For these architectures, the cache needs to be flushed on
both the kernel and user addresses so that user space can see the updates
the kernel has made.

Instead of using flush_dcache_folio() and playing with virt_to_folio()
within the call to that function, use flush_kernel_vmap_range() which
takes the virtual address and does the work for those architectures that
need it.

This also fixes a bug where the flush of the reader page only flushed one
page. If the sub-buffer order is 1 or more, where the sub-buffer size
would be greater than a page, it would miss the rest of the sub-buffer
content, as the "reader page" is not just a page, but the size of a
sub-buffer.

Link: https://lore.kernel.org/all/CAG48ez3w0my4Rwttbc5tEbNsme6tc0mrSN95thjXUFaJ3aQ6SA@mail.gmail.com/

Cc: stable@vger.kernel.org
Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions");
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d8d7b28e2c2f..c0f877d39a24 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6016,7 +6016,7 @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
 	meta->read = cpu_buffer->read;
 
 	/* Some archs do not have data cache coherency between kernel and user-space */
-	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
+	flush_kernel_vmap_range(cpu_buffer->meta_page, PAGE_SIZE);
 }
 
 static void
@@ -7319,7 +7319,8 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 
 out:
 	/* Some archs do not have data cache coherency between kernel and user-space */
-	flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
+	flush_kernel_vmap_range(cpu_buffer->reader_page->page,
+				buffer->subbuf_size + BUF_PAGE_HDR_SIZE);
 
 	rb_update_meta_page(cpu_buffer);
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned
  2025-04-01 22:58 ` [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned Steven Rostedt
@ 2025-04-02  9:21   ` Mike Rapoport
  2025-04-02 14:26     ` Steven Rostedt
  2025-04-02 15:01     ` Mathieu Desnoyers
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Rapoport @ 2025-04-02  9:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Linus Torvalds,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Vincent Donnefort, Vlastimil Babka, Jann Horn

On Tue, Apr 01, 2025 at 06:58:12PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Enforce that the address and the size of the memory used by the persistent
> ring buffer is page aligned. Also update the documentation to reflect this
> requirement.
> 
> Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  2 ++
>  Documentation/trace/debugging.rst               |  2 ++
>  kernel/trace/trace.c                            | 12 ++++++++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3435a062a208..f904fd8481bd 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -7266,6 +7266,8 @@
>  			This is just one of many ways that can clear memory. Make sure your system
>  			keeps the content of memory across reboots before relying on this option.
>  
> +			NB: Both the mapped address and size must be page aligned for the architecture.
> +
>  			See also Documentation/trace/debugging.rst
>  
>  
> diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
> index 54fb16239d70..d54bc500af80 100644
> --- a/Documentation/trace/debugging.rst
> +++ b/Documentation/trace/debugging.rst
> @@ -136,6 +136,8 @@ kernel, so only the same kernel is guaranteed to work if the mapping is
>  preserved. Switching to a different kernel version may find a different
>  layout and mark the buffer as invalid.
>  
> +NB: Both the mapped address and size must be page aligned for the architecture.
> +
>  Using trace_printk() in the boot instance
>  -----------------------------------------
>  By default, the content of trace_printk() goes into the top level tracing
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index de6d7f0e6206..de9c237e5826 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -10788,6 +10788,18 @@ __init static void enable_instances(void)
>  		}
>  
>  		if (start) {
> +			/* Start and size must be page aligned */
> +			if (start & ~PAGE_MASK) {
> +				pr_warn("Tracing: mapping start addr %lx is not page aligned\n",
> +					(unsigned long)start);
> +				continue;
> +			}
> +			if (size & ~PAGE_MASK) {
> +				pr_warn("Tracing: mapping size %lx is not page aligned\n",
> +					(unsigned long)size);
> +				continue;
> +			}

Better use %pa for printing physical address as on 32-bit systems
phys_addr_t may be unsigned long long:

	pr_warn("Tracing: mapping size %pa is not page aligned\n", &size);

> +
>  			addr = map_pages(start, size);
>  			if (addr) {
>  				pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
> -- 
> 2.47.2
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
  2025-04-01 22:58 ` [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer Steven Rostedt
@ 2025-04-02  9:24   ` Mike Rapoport
  2025-04-02 14:28     ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Rapoport @ 2025-04-02  9:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Linus Torvalds,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Vincent Donnefort, Vlastimil Babka, Jann Horn

On Tue, Apr 01, 2025 at 06:58:13PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> The reserve_mem kernel command line option may pass back a physical
> address, but the memory is still part of the normal memory just like
> using memblock_reserve() would be. This means that the physical memory

... using memblock_alloc() would be

> returned by the reserve_mem command line option can be converted directly
> to virtual memory by simply using phys_to_virt().
> 
> When freeing the buffer there's no need to call vunmap() anymore as the
> memory allocated by reserve_mem is freed by the call to
> reserve_mem_release_by_name().
> 
> Because the persistent ring buffer can also be allocated via the memmap
> option, which *is* different than normal memory as it cannot be added back
> to the buddy system, it must be treated differently. It still needs to be
> virtually mapped to have access to it. It also can not be freed nor can it
> ever be memory mapped to user space.
> 
> Create a new trace_array flag called TRACE_ARRAY_FL_MEMMAP which gets set
> if the buffer is created by the memmap option, and this will prevent the
> buffer from being memory mapped by user space.
> 
> Also increment the ref count for memmap'ed buffers so that they can never
> be freed.
> 
> Link: https://lore.kernel.org/all/Z-wFszhJ_9o4dc8O@kernel.org/
> 
> Suggested-by: Mike Rapoport <rppt@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  kernel/trace/trace.c | 23 ++++++++++++++++-------
>  kernel/trace/trace.h |  1 +
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index de9c237e5826..2f9c91f26d5b 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -8505,6 +8505,10 @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
>  	struct trace_iterator *iter = &info->iter;
>  	int ret = 0;
>  
> +	/* A memmap'ed buffer is not supported for user space mmap */
> +	if (iter->tr->flags & TRACE_ARRAY_FL_MEMMAP)
> +		return -ENODEV;
> +
>  	/* Currently the boot mapped buffer is not supported for mmap */
>  	if (iter->tr->flags & TRACE_ARRAY_FL_BOOT)
>  		return -ENODEV;
> @@ -9614,9 +9618,6 @@ static void free_trace_buffers(struct trace_array *tr)
>  #ifdef CONFIG_TRACER_MAX_TRACE
>  	free_trace_buffer(&tr->max_buffer);
>  #endif
> -
> -	if (tr->range_addr_start)
> -		vunmap((void *)tr->range_addr_start);
>  }
>  
>  static void init_trace_flags_index(struct trace_array *tr)
> @@ -10710,6 +10711,7 @@ static inline void do_allocate_snapshot(const char *name) { }
>  __init static void enable_instances(void)
>  {
>  	struct trace_array *tr;
> +	bool memmap_area = false;
>  	char *curr_str;
>  	char *name;
>  	char *str;
> @@ -10778,6 +10780,7 @@ __init static void enable_instances(void)
>  					name);
>  				continue;
>  			}
> +			memmap_area = true;
>  		} else if (tok) {
>  			if (!reserve_mem_find_by_name(tok, &start, &size)) {
>  				start = 0;
> @@ -10800,7 +10803,10 @@ __init static void enable_instances(void)
>  				continue;
>  			}
>  
> -			addr = map_pages(start, size);
> +			if (memmap_area)
> +				addr = map_pages(start, size);
> +			else
> +				addr = (unsigned long)phys_to_virt(start);
>  			if (addr) {
>  				pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
>  					name, &start, (unsigned long)size);
> @@ -10827,10 +10833,13 @@ __init static void enable_instances(void)
>  			update_printk_trace(tr);
>  
>  		/*
> -		 * If start is set, then this is a mapped buffer, and
> -		 * cannot be deleted by user space, so keep the reference
> -		 * to it.
> +		 * memmap'd buffers can not be freed.
>  		 */
> +		if (memmap_area) {
> +			tr->flags |= TRACE_ARRAY_FL_MEMMAP;
> +			tr->ref++;
> +		}
> +
>  		if (start) {
>  			tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT;
>  			tr->range_name = no_free_ptr(rname);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index c20f6bcc200a..f9513dc14c37 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -447,6 +447,7 @@ enum {
>  	TRACE_ARRAY_FL_BOOT		= BIT(1),
>  	TRACE_ARRAY_FL_LAST_BOOT	= BIT(2),
>  	TRACE_ARRAY_FL_MOD_INIT		= BIT(3),
> +	TRACE_ARRAY_FL_MEMMAP		= BIT(4),
>  };
>  
>  #ifdef CONFIG_MODULES
> -- 
> 2.47.2
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned
  2025-04-02  9:21   ` Mike Rapoport
@ 2025-04-02 14:26     ` Steven Rostedt
  2025-04-02 15:01     ` Mathieu Desnoyers
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2025-04-02 14:26 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, linux-trace-kernel, Linus Torvalds,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Vincent Donnefort, Vlastimil Babka, Jann Horn

On Wed, 2 Apr 2025 12:21:49 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> > +			if (size & ~PAGE_MASK) {
> > +				pr_warn("Tracing: mapping size %lx is not page aligned\n",
> > +					(unsigned long)size);
> > +				continue;
> > +			}  
> 
> Better use %pa for printing physical address as on 32-bit systems
> phys_addr_t may be unsigned long long:
> 
> 	pr_warn("Tracing: mapping size %pa is not page aligned\n", &size);

Thanks, will update.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
  2025-04-02  9:24   ` Mike Rapoport
@ 2025-04-02 14:28     ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2025-04-02 14:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, linux-trace-kernel, Linus Torvalds,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Vincent Donnefort, Vlastimil Babka, Jann Horn

On Wed, 2 Apr 2025 12:24:12 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> > The reserve_mem kernel command line option may pass back a physical
> > address, but the memory is still part of the normal memory just like
> > using memblock_reserve() would be. This means that the physical memory  
> 
> ... using memblock_alloc() would be

Heh, I looked for examples of code that used free_reserved_area() and found
initramfs.c which uses memblock_reserve(), which is why I picked that.

I'll update.

Thanks for reviewing Mike!

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned
  2025-04-02  9:21   ` Mike Rapoport
  2025-04-02 14:26     ` Steven Rostedt
@ 2025-04-02 15:01     ` Mathieu Desnoyers
  2025-04-02 15:03       ` Mathieu Desnoyers
  1 sibling, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-04-02 15:01 UTC (permalink / raw)
  To: Mike Rapoport, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Linus Torvalds,
	Masami Hiramatsu, Mark Rutland, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Jann Horn

On 2025-04-02 05:21, Mike Rapoport wrote:
> On Tue, Apr 01, 2025 at 06:58:12PM -0400, Steven Rostedt wrote:
>> From: Steven Rostedt <rostedt@goodmis.org>
>>
>> Enforce that the address and the size of the memory used by the persistent
>> ring buffer is page aligned. Also update the documentation to reflect this
>> requirement.

I've been loosely following this thread, and I'm confused about one
thing.

AFAIU the goal is to have the ftrace persistent ring buffer written to
through a memory range mapped by vmap_page_range(), and userspace maps
the buffer with its own virtual mappings.

With respect to architectures with aliasing dcache, is the plan:

A) To make sure all persistent ring buffer mappings are aligned on
    SHMLBA:

Quoting "Documentation/core-api/cachetlb.rst":

   Is your port susceptible to virtual aliasing in its D-cache?
   Well, if your D-cache is virtually indexed, is larger in size than
   PAGE_SIZE, and does not prevent multiple cache lines for the same
   physical address from existing at once, you have this problem.

   If your D-cache has this problem, first define asm/shmparam.h SHMLBA
   properly, it should essentially be the size of your virtually
   addressed D-cache (or if the size is variable, the largest possible
   size).  This setting will force the SYSv IPC layer to only allow user
   processes to mmap shared memory at address which are a multiple of
   this value.

or

B) to flush both the kernel and userspace mappings when a ring buffer
    page is handed over from writer to reader ?

I've seen both approaches being discussed in the recent threads, with
some participants recommending approach (A), but then the code
revisions that follow take approach (B).

AFAIU, it we are aiming for approach (A), then I'm missing where
vmap_page_range() guarantees that the _kernel_ virtual mapping is
SHMLBA aligned. AFAIU, only user mappings are aligned on SHMLBA.

And if we aiming towards approach (A), then the explicit flushing
is not needed when handing over pages from writer to reader.

Please let me know if I'm missing something,

Thanks,

Mathieu

>>
>> Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/
>>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>> ---
>>   Documentation/admin-guide/kernel-parameters.txt |  2 ++
>>   Documentation/trace/debugging.rst               |  2 ++
>>   kernel/trace/trace.c                            | 12 ++++++++++++
>>   3 files changed, 16 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 3435a062a208..f904fd8481bd 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -7266,6 +7266,8 @@
>>   			This is just one of many ways that can clear memory. Make sure your system
>>   			keeps the content of memory across reboots before relying on this option.
>>   
>> +			NB: Both the mapped address and size must be page aligned for the architecture.
>> +
>>   			See also Documentation/trace/debugging.rst
>>   
>>   
>> diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
>> index 54fb16239d70..d54bc500af80 100644
>> --- a/Documentation/trace/debugging.rst
>> +++ b/Documentation/trace/debugging.rst
>> @@ -136,6 +136,8 @@ kernel, so only the same kernel is guaranteed to work if the mapping is
>>   preserved. Switching to a different kernel version may find a different
>>   layout and mark the buffer as invalid.
>>   
>> +NB: Both the mapped address and size must be page aligned for the architecture.
>> +
>>   Using trace_printk() in the boot instance
>>   -----------------------------------------
>>   By default, the content of trace_printk() goes into the top level tracing
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index de6d7f0e6206..de9c237e5826 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -10788,6 +10788,18 @@ __init static void enable_instances(void)
>>   		}
>>   
>>   		if (start) {
>> +			/* Start and size must be page aligned */
>> +			if (start & ~PAGE_MASK) {
>> +				pr_warn("Tracing: mapping start addr %lx is not page aligned\n",
>> +					(unsigned long)start);
>> +				continue;
>> +			}
>> +			if (size & ~PAGE_MASK) {
>> +				pr_warn("Tracing: mapping size %lx is not page aligned\n",
>> +					(unsigned long)size);
>> +				continue;
>> +			}
> 
> Better use %pa for printing physical address as on 32-bit systems
> phys_addr_t may be unsigned long long:
> 
> 	pr_warn("Tracing: mapping size %pa is not page aligned\n", &size);
> 
>> +
>>   			addr = map_pages(start, size);
>>   			if (addr) {
>>   				pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n",
>> -- 
>> 2.47.2
>>
>>
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned
  2025-04-02 15:01     ` Mathieu Desnoyers
@ 2025-04-02 15:03       ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-04-02 15:03 UTC (permalink / raw)
  To: Mike Rapoport, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Linus Torvalds,
	Masami Hiramatsu, Mark Rutland, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Jann Horn

On 2025-04-02 11:01, Mathieu Desnoyers wrote:
> On 2025-04-02 05:21, Mike Rapoport wrote:
>> On Tue, Apr 01, 2025 at 06:58:12PM -0400, Steven Rostedt wrote:
>>> From: Steven Rostedt <rostedt@goodmis.org>
>>>
>>> Enforce that the address and the size of the memory used by the 
>>> persistent
>>> ring buffer is page aligned. Also update the documentation to reflect 
>>> this
>>> requirement.
> 

[ Please disregard this duplicate message and consider
   https://lore.kernel.org/lkml/c3e395d7-0c64-44d0-a0a7-57205b2ab712@efficios.com/T/#m461e6111397b33c037f6fb746ed74ffbd0a4340f
   instead. ]

> I've been loosely following this thread, and I'm confused about one
> thing.
> 
> AFAIU the goal is to have the ftrace persistent ring buffer written to
> through a memory range mapped by vmap_page_range(), and userspace maps
> the buffer with its own virtual mappings.
> 
> With respect to architectures with aliasing dcache, is the plan:
> 
> A) To make sure all persistent ring buffer mappings are aligned on
>     SHMLBA:
> 
> Quoting "Documentation/core-api/cachetlb.rst":
> 
>    Is your port susceptible to virtual aliasing in its D-cache?
>    Well, if your D-cache is virtually indexed, is larger in size than
>    PAGE_SIZE, and does not prevent multiple cache lines for the same
>    physical address from existing at once, you have this problem.
> 
>    If your D-cache has this problem, first define asm/shmparam.h SHMLBA
>    properly, it should essentially be the size of your virtually
>    addressed D-cache (or if the size is variable, the largest possible
>    size).  This setting will force the SYSv IPC layer to only allow user
>    processes to mmap shared memory at address which are a multiple of
>    this value.
> 
> or
> 
> B) to flush both the kernel and userspace mappings when a ring buffer
>     page is handed over from writer to reader ?
> 
> I've seen both approaches being discussed in the recent threads, with
> some participants recommending approach (A), but then the code
> revisions that follow take approach (B).
> 
> AFAIU, it we are aiming for approach (A), then I'm missing where
> vmap_page_range() guarantees that the _kernel_ virtual mapping is
> SHMLBA aligned. AFAIU, only user mappings are aligned on SHMLBA.
> 
> And if we aiming towards approach (A), then the explicit flushing
> is not needed when handing over pages from writer to reader.
> 
> Please let me know if I'm missing something,
> 
> Thanks,
> 
> Mathieu
> 
>>>
>>> Link: https://lore.kernel.org/all/CAHk- 
>>> =whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/
>>>
>>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>>> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
>>> ---
>>>   Documentation/admin-guide/kernel-parameters.txt |  2 ++
>>>   Documentation/trace/debugging.rst               |  2 ++
>>>   kernel/trace/trace.c                            | 12 ++++++++++++
>>>   3 files changed, 16 insertions(+)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/ 
>>> Documentation/admin-guide/kernel-parameters.txt
>>> index 3435a062a208..f904fd8481bd 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -7266,6 +7266,8 @@
>>>               This is just one of many ways that can clear memory. 
>>> Make sure your system
>>>               keeps the content of memory across reboots before 
>>> relying on this option.
>>> +            NB: Both the mapped address and size must be page 
>>> aligned for the architecture.
>>> +
>>>               See also Documentation/trace/debugging.rst
>>> diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/ 
>>> debugging.rst
>>> index 54fb16239d70..d54bc500af80 100644
>>> --- a/Documentation/trace/debugging.rst
>>> +++ b/Documentation/trace/debugging.rst
>>> @@ -136,6 +136,8 @@ kernel, so only the same kernel is guaranteed to 
>>> work if the mapping is
>>>   preserved. Switching to a different kernel version may find a 
>>> different
>>>   layout and mark the buffer as invalid.
>>> +NB: Both the mapped address and size must be page aligned for the 
>>> architecture.
>>> +
>>>   Using trace_printk() in the boot instance
>>>   -----------------------------------------
>>>   By default, the content of trace_printk() goes into the top level 
>>> tracing
>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>>> index de6d7f0e6206..de9c237e5826 100644
>>> --- a/kernel/trace/trace.c
>>> +++ b/kernel/trace/trace.c
>>> @@ -10788,6 +10788,18 @@ __init static void enable_instances(void)
>>>           }
>>>           if (start) {
>>> +            /* Start and size must be page aligned */
>>> +            if (start & ~PAGE_MASK) {
>>> +                pr_warn("Tracing: mapping start addr %lx is not page 
>>> aligned\n",
>>> +                    (unsigned long)start);
>>> +                continue;
>>> +            }
>>> +            if (size & ~PAGE_MASK) {
>>> +                pr_warn("Tracing: mapping size %lx is not page 
>>> aligned\n",
>>> +                    (unsigned long)size);
>>> +                continue;
>>> +            }
>>
>> Better use %pa for printing physical address as on 32-bit systems
>> phys_addr_t may be unsigned long long:
>>
>>     pr_warn("Tracing: mapping size %pa is not page aligned\n", &size);
>>
>>> +
>>>               addr = map_pages(start, size);
>>>               if (addr) {
>>>                   pr_info("Tracing: mapped boot instance %s at 
>>> physical memory %pa of size 0x%lx\n",
>>> -- 
>>> 2.47.2
>>>
>>>
>>
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-01 22:58 ` [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer Steven Rostedt
@ 2025-04-02 16:42   ` Linus Torvalds
  2025-04-02 16:55     ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2025-04-02 16:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Tue, 1 Apr 2025 at 15:57, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The code to map the physical memory retrieved by memmap currently
> allocates an array of pages to cover the physical memory and then calls
> vmap() to map it to a virtual address. Instead of using this temporary
> array of struct page descriptors, simply use vmap_page_range() that can
> directly map the contiguous physical memory to a virtual address.

Wait, what?

Didn't you just say that you can just use page_address() for the kernel mapping?

So the whole vmap thing is entirely unnecessary in the first place.

Including the simpler vmap_page_range().

             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-02 16:42   ` Linus Torvalds
@ 2025-04-02 16:55     ` Steven Rostedt
  2025-04-02 17:03       ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-02 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Wed, 2 Apr 2025 09:42:00 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Wait, what?
> 
> Didn't you just say that you can just use page_address() for the kernel mapping?

Correct. I misunderstood Mike when he created the reserve_mem and had it
return physical addresses. I didn't realize he had it already mapped via
the normal virtual mappings.

> 
> So the whole vmap thing is entirely unnecessary in the first place.
> 
> Including the simpler vmap_page_range().

Not entirely. The original code only used memmap=, which does require the
vmap_page_range(). The reserve_mem came later, and thinking it was just
physical address space (not a mapping), I just had it use the memmap code.

That is, the persistent memory buffer was originally created on top of the
memmap feature (and only for x86). I talked with Mike about making it more
generic and he and I worked out the reserve_mem option. Having the memmap
code already doing the vmap() I just had the reserve_mem version use the
same, not knowing that it could just use phys_to_virt().

This patch series fixes that miscommunication and separates out a memmap'ed
buffer from reserve_mem buffer and simplifies everything.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-02 16:55     ` Steven Rostedt
@ 2025-04-02 17:03       ` Steven Rostedt
  2025-04-02 17:14         ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-02 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Wed, 2 Apr 2025 12:55:48 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> This patch series fixes that miscommunication and separates out a memmap'ed
> buffer from reserve_mem buffer and simplifies everything.

Since I only care about memory mapping a buffer from reserve_mem to user
space, this series makes sure that a buffer created from memmap (or any
other physical memory area) is never used for mapping to user space. It
also prevents that buffer from being freed, as it is not part of the buddy
allocator and can't be added later.

That also means all the tricks to determine where the page in the ring
buffer came from can go away.

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-02 17:03       ` Steven Rostedt
@ 2025-04-02 17:14         ` Steven Rostedt
  2025-04-02 17:20           ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-02 17:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Wed, 2 Apr 2025 13:03:37 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> That also means all the tricks to determine where the page in the ring
> buffer came from can go away.

I wonder, just to be safe, if I should add in the ring buffer mapping code:

	/* If the buffer is not part of the normal memory, do not allow mapping */
	if (!virt_addr_valid(cpu_buffer->reader_page->page))
		return -ENODEV;

?

As the buffer can come from anywhere, but we only allow a memory mapping to
user space if the buffer is from the normal memory allocator.

Currently, this patch series has the caller (the tracing code) prevent
allowing memmap to be user mapped. Perhaps the above should include a
WARN_ON_ONCE()?

-- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-02 17:14         ` Steven Rostedt
@ 2025-04-02 17:20           ` Linus Torvalds
  2025-04-02 17:40             ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2025-04-02 17:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Wed, 2 Apr 2025 at 10:13, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>         /* If the buffer is not part of the normal memory, do not allow mapping */
>         if (!virt_addr_valid(cpu_buffer->reader_page->page))
>                 return -ENODEV;

Honestly, virt_addr_valid() is pretty much random.

It's not really a valid function outside of random debugging code. It
has no real semantics. The comment saying that it validates some kind
of 'virt_to_page()' thing is pure garbage.

> As the buffer can come from anywhere, but we only allow a memory mapping to
> user space if the buffer is from the normal memory allocator.

You should damn well keep track of where the memory comes from.

You can't just say "I'll take random shit, and then I'll ask the VM what it is".

Stop it.

If the address came from something you consider trustworthy, then just
trust it. If some admin person gave you garbage, it's better to just
get a random oops than to have random bogus code.

                 Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-02 17:20           ` Linus Torvalds
@ 2025-04-02 17:40             ` Steven Rostedt
  2025-04-02 17:46               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2025-04-02 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Wed, 2 Apr 2025 10:20:58 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> You should damn well keep track of where the memory comes from.

And it does.

> 
> You can't just say "I'll take random shit, and then I'll ask the VM what it is".
> 
> Stop it.
> 
> If the address came from something you consider trustworthy, then just
> trust it. If some admin person gave you garbage, it's better to just
> get a random oops than to have random bogus code.

This has nothing to do with admins. This would only occur if the kernel
itself created a buffer from some random physical address and then tried to
mmap it to user space (which would be a bug).

My early career came from the military industry (I worked on the C17 engine
control systems in the early '90s). It was drilled in my head to have
safety checks throughout the code, in case something buggy happened, it
would be caught quickly later on.

The virt_addr_valid() would only be a debugging feature, hence the
added "WARN_ON()" for it.

But I'm fine to not do that.

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer
  2025-04-02 17:40             ` Steven Rostedt
@ 2025-04-02 17:46               ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2025-04-02 17:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Vincent Donnefort,
	Vlastimil Babka, Mike Rapoport, Jann Horn

On Wed, 2 Apr 2025 at 10:39, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> This has nothing to do with admins. This would only occur if the kernel
> itself created a buffer from some random physical address and then tried to
> mmap it to user space (which would be a bug).

Do *not* try to check for bugs like that with virt_addr_valid().

It literally snakes debugging harder.

You're much better off getting an oops,. and then you have stack
traces, distro bug trackers, and various other automated tools that
give you information.

Trying to "validate" buggy data is crazy. It's absolutely the opposite
of safety. It's going to cause more bugs, it's going to only work for
the validation scenarios you thought about, and it's going to make it
harder to debug the cases it actually catches.

And if you are trying to catch kernel bugs, *any* data could be that
buggy data. So the whole concept is insane.

Yes, you could make every single line be a "WARN_ON()" with some
random check for the particular data you are using.

Or you could just write good solid code that is actually readable and
maintainable, and doesn't have random pointless checks in it.

          Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-04-02 17:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-01 22:58 [PATCH v5 0/4] tracing: Clean up persistent ring buffer code Steven Rostedt
2025-04-01 22:58 ` [PATCH v5 1/4] tracing: Enforce the persistent ring buffer to be page aligned Steven Rostedt
2025-04-02  9:21   ` Mike Rapoport
2025-04-02 14:26     ` Steven Rostedt
2025-04-02 15:01     ` Mathieu Desnoyers
2025-04-02 15:03       ` Mathieu Desnoyers
2025-04-01 22:58 ` [PATCH v5 2/4] tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer Steven Rostedt
2025-04-02  9:24   ` Mike Rapoport
2025-04-02 14:28     ` Steven Rostedt
2025-04-01 22:58 ` [PATCH v5 3/4] tracing: Use vmap_page_range() to map memmap ring buffer Steven Rostedt
2025-04-02 16:42   ` Linus Torvalds
2025-04-02 16:55     ` Steven Rostedt
2025-04-02 17:03       ` Steven Rostedt
2025-04-02 17:14         ` Steven Rostedt
2025-04-02 17:20           ` Linus Torvalds
2025-04-02 17:40             ` Steven Rostedt
2025-04-02 17:46               ` Linus Torvalds
2025-04-01 22:58 ` [PATCH v5 4/4] ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio() Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).