public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping
@ 2023-12-29 18:40 Steven Rostedt
  2023-12-29 18:40 ` [RFC][PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Steven Rostedt @ 2023-12-29 18:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Vincent Donnefort


I'm sending this to a wider audience, as I want to hear more
feedback on this before I accept it.

Vincent has been working on allowing the ftrace ring buffer to be
memory mapped into user space. This has been going on since
last year, where we talked at the 2022 Tracing Summit in London.

Vincent's last series can be found here:

   https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-1-vdonnefort@google.com/

But I'm posting these as these are what I now have in my queue.

I've tested these patches pretty thoroughly and they look good.
I even have a libtracefs API patch ready to implement this.

For testing, you can install:

 git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git (version 1.8.1)

 git://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git
  (latest branch of libtracefs)

Then apply:

  https://lore.kernel.org/all/20231228201100.78aae259@rorschach.local.home/

to libtracefs. And then build the samples:

  make samples

Which will create: bin/cpu-map

That you can use with:

  # trace-cmd start -e sched
  # bin/cpu-map 0

Which will output all the events in CPU 0 via the memory mapping if it is
supported by the kernel or else it will print at the start:

  "Was not able to map, falling back to buffered read"

If you want to see the source code for cpu-map, you have to look
at the man page ;-) The "make samples" will extract the example code
from the man pages and build them.

  Documentation/libtracefs-cpu-map.txt

The library API is rather simple, it just has:

 tracefs_cpu_open_mapped()
 tracefs_cpu_is_mapped()
 tracefs_cpu_map()
 tracefs_cpu_unmap()

Which will create a tracefs_cpu handle for reading. If it fails to
map, it just does a normal read.

Anyway, this email is not about the library interface, but more of
the kernel interface. And that is this:

First there's a "meta page" that can mapped via mmap() on the
trace_pipe_raw file (there's one trace_pipe_raw file per CPU):

        page_size = getpagesize();
        meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);

The meta will have a layout of:

struct trace_buffer_meta {
        unsigned long   entries;
        unsigned long   overrun;
        unsigned long   read;

        unsigned long   subbufs_touched;
        unsigned long   subbufs_lost;
        unsigned long   subbufs_read;

        struct {
                unsigned long   lost_events;    /* Events lost at the time of the reader swap */
                __u32           id;             /* Reader subbuf ID from 0 to nr_subbufs - 1 */
                __u32           read;           /* Number of bytes read on the reader subbuf */
        } reader;

        __u32           subbuf_size;            /* Size of each subbuf including the header */
        __u32           nr_subbufs;             /* Number of subbufs in the ring-buffer */

        __u32           meta_page_size;         /* Size of the meta-page */
        __u32           meta_struct_len;        /* Len of this struct */
};

The meta_page_size can grow if we need to (and then we can extend this
API if we need to). If the meta_page_size is greater than a page, it
should remap it:

        if (meta->meta_page_size > page_size) {
		int new_size = meta->meta_page_size;
                munmap(meta, page_size);
                meta = mmap(NULL, new_size, PROT_READ, MAP_SHARED, fd, 0);		}

Now the sub buffers of the ring buffer are mapped right after
the meta page. The can be added with:


	data_len = meta->subbuf_size * meta->nr_subbufs;
	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
                          fd, meta->meta_page_size);

This maps all the ring buffer sub-buffers as well as the reader page.
The way this works is that the reader page is free to read from
user space and writer will only write to the other pages.

To get the reader page:

	subbuf = data + meta->subbuf_size * meta->reader.id;

Then you can load that into the libtraceevent kbuffer API:

	struct tep_handle *tep = tracefs_local_events(NULL);
	kbuf = tep_kbuffer(tep);
	kbuffer_load_subbuffer(kbuf, subbuf);

And use the kbuf descriptor to iterate the events.

When done with the reader page, the application needs to make an
ioctl() call:

	ioctl(fd, TRACE_MMAP_IOCTL_GET_READER);


This will swap the reader page with the head of the other pages,
and the old reader page is now going to be in the writable portion
of the ring buffer where the writer can write to it, but the
page that was swapped out becomes the new reader page. If there is
no data, then user space can check the meta->reader.id to see if
it changed or not. If it did not change, then there's no new data.

If the writer is still on that page, it acts the same as we do in
the kernel. It can still update that page and the begging of the
sub-buffer has the index of where the writer currently is on that
page.

All the mapped pages are read-only.

When the ring buffer is mapped, it cannot be resized or have its
sub-buffers resized. It is basically locked. Any splice system calls
on it will be copied instead of writing.

I think it will be possible to extend this to a way to avoid the
ioctl and have the reader just compare the content before and after,
but I'm not so worried about that right now.

As it is close to the merge window, I do not plan on pushing it into
6.8. But I do want it in 6.9. I believe it's currently ready as is, but I'm
willing to change it if someone comes up with a good argument to do
so.

Thanks!


Vincent Donnefort (2):
      ring-buffer: Introducing ring-buffer mapping functions
      tracing: Allow user-space mapping of the ring-buffer

----
 include/linux/ring_buffer.h     |   7 +
 include/uapi/linux/trace_mmap.h |  31 ++++
 kernel/trace/ring_buffer.c      | 382 +++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.c            |  79 ++++++++-
 4 files changed, 495 insertions(+), 4 deletions(-)
 create mode 100644 include/uapi/linux/trace_mmap.h

^ permalink raw reply	[flat|nested] 5+ messages in thread
* [RFC PATCH 0/2] Introducing trace buffer mapping by user-space
@ 2023-02-12 15:32 Vincent Donnefort
  2023-02-12 15:32 ` [RFC PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
  0 siblings, 1 reply; 5+ messages in thread
From: Vincent Donnefort @ 2023-02-12 15:32 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: kernel-team, Vincent Donnefort

Hi all,

We (Android folks) have been recently working on bringing tracing to the
pKVM hypervisor (more about pKVM? [1] [2]) reusing as much as possible the
tracefs support already available in the host. More specifically, sharing
the ring_buffer_per_cpu between the kernel and the hypervisor, the later
being the writer while the former is only reading. After presenting this
endeavour at the tracingsummit, end of last year [3], Steven observed this
is a similar problem to another idea he had a while ago: mapping the
tracing ring buffers directly into userspace.

The tracing ring-buffer can be stored or sent to network without any copy
via splice. However the later doesn't allow real time processing of the
traces by userspace without a copy, which can only be achieved by letting
userspace map directly the ring-buffer.

And indeed, in both ideas, we have a ring-buffer, an entity being the
writer, the other being a reader and both share the ring buffer pages while
having different VA spaces. So here's an RFC bringing userspace mapping of
a ring-buffer and if it doesn't cover the pKVM hypervisor it nonetheless
brings building blocks that will be reused later.

Any feedback very much appreciated.

Vincent

[1] https://lwn.net/Articles/836693/
[2] https://www.youtube.com/watch?v=9npebeVFbFw
[3] https://tracingsummit.org/ts/2022/hypervisortracing/

-- 

As an example, Steve wrote this quick demo that only needs libtracefs:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <stdarg.h>
  #include <errno.h>
  #include <unistd.h>
  #include <tracefs.h>
  #include <kbuffer.h>
  #include <event-parse.h>
  
  #include <asm/types.h>
  #include <sys/mman.h>
  #include <sys/ioctl.h>
  
  #define TRACE_MMAP_IOCTL_GET_READER_PAGE	_IO('T', 0x1)
  #define TRACE_MMAP_IOCTL_UPDATE_META_PAGE	_IO('T', 0x2)
  
  struct ring_buffer_meta_page {
  	__u64		entries;
  	__u64		overrun;
  	__u32		pages_touched;
  	__u32		reader_page;
  	__u32		nr_data_pages;
  	__u32		data_page_head;
  	__u32		data_pages[];
  };
  
  static char *argv0;
  static int page_size;
  
  static char *get_this_name(void)
  {
  	static char *this_name;
  	char *arg;
  	char *p;
  
  	if (this_name)
  		return this_name;
  
  	arg = argv0;
  	p = arg+strlen(arg);
  
  	while (p >= arg && *p != '/')
  		p--;
  	p++;
  
  	this_name = p;
  	return p;
  }
  
  static void usage(void)
  {
  	char *p = get_this_name();
  
  	printf("usage: %s exec\n"
  	       "\n",p);
  	exit(-1);
  }
  
  static void __vdie(const char *fmt, va_list ap, int err)
  {
  	int ret = errno;
  	char *p = get_this_name();
  
  	if (err && errno)
  		perror(p);
  	else
  		ret = -1;
  
  	fprintf(stderr, "  ");
  	vfprintf(stderr, fmt, ap);
  
  	fprintf(stderr, "\n");
  	exit(ret);
  }
  
  void die(const char *fmt, ...)
  {
  	va_list ap;
  
  	va_start(ap, fmt);
  	__vdie(fmt, ap, 0);
  	va_end(ap);
  }
  
  void pdie(const char *fmt, ...)
  {
  	va_list ap;
  
  	va_start(ap, fmt);
  	__vdie(fmt, ap, 1);
  	va_end(ap);
  }
  
  static void read_page(struct tep_handle *tep, struct kbuffer *kbuf,
  		      void *data, int page)
  {
  	static struct trace_seq seq;
  	struct tep_record record;
  
  	if (seq.buffer)
  		trace_seq_reset(&seq);
  	else
  		trace_seq_init(&seq);
  
  	kbuffer_load_subbuffer(kbuf, data + page_size * page);
  	while ((record.data = kbuffer_read_event(kbuf, &record.ts))) {
  		kbuffer_next_event(kbuf, NULL);
  		tep_print_event(tep, &seq, &record,
  				"%s-%d %9d\t%s: %s\n",
  				TEP_PRINT_COMM,
  				TEP_PRINT_PID,
  				TEP_PRINT_TIME,
  				TEP_PRINT_NAME,
  				TEP_PRINT_INFO);
  		trace_seq_do_printf(&seq);
  		trace_seq_reset(&seq);
  	}
  }
  
  static int get_reader_page(int fd, struct ring_buffer_meta_page *meta)
  {
  	return meta->reader_page;
  }
  
  static int next_reader_page(int fd, struct ring_buffer_meta_page *meta)
  {
  	if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER_PAGE) < 0)
  		pdie("ioctl");
  	return meta->reader_page;
  }
  
  int main (int argc, char **argv)
  {
  	struct ring_buffer_meta_page *map;
  	struct tep_handle *tep;
  	struct kbuffer *kbuf;
  	unsigned long *p;
  	void *meta;
  	void *data;
  	char *buf;
  	int data_len;
  	int start;
  	int page;
  	int fd;
  
  	argv0 = argv[0];
  
  	tep = tracefs_local_events(NULL);
  	kbuf = tep_kbuffer(tep);
  
  	page_size = getpagesize();
  
  	fd = tracefs_instance_file_open(NULL, "per_cpu/cpu0/trace_pipe_raw",
  					O_RDONLY);
  	if (fd < 0)
  		pdie("raw");
  
  	meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
  	if (meta == MAP_FAILED)
  		pdie("mmap");
  
  	if (ioctl(fd, TRACE_MMAP_IOCTL_UPDATE_META_PAGE) < 0)
  		pdie("ioctl");
  
  	map = meta;
  	printf("entries:	%llu\n", map->entries);
  	printf("overrun:	%llu\n", map->overrun);
  	printf("pages_touched:	%u\n", map->pages_touched);
  	printf("reader_page:	%u\n", map->reader_page);
  	printf("nr_data_pages:	%u\n\n", map->nr_data_pages);
  
  	data_len = page_size * map->nr_data_pages;
  
  	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, page_size);
  	if (data == MAP_FAILED)
  		pdie("mmap data");
  
  	page = get_reader_page(fd, meta);
  	start = page;
  	do {
  		read_page(tep, kbuf, data, page);
  		printf("reader_page:	%u\n", map->reader_page);
  		printf("PAGE: %d\n", page);
  	} while ((page = next_reader_page(fd, meta)) != start);
  	
  	p = data;
  	printf("%lx\n%lx\n%lx\n\n", p[0], p[1], p[2]);
  
  	munmap(data, data_len);
  	munmap(meta, page_size);
  	close(fd);
  
  	buf = tracefs_instance_file_read(NULL, "per_cpu/cpu0/stats", NULL);
  	if (!buf)
  		pdie("stats");
  	printf("%s\n", buf);
  	free(buf);
  
  
  	return 0;
  }

Vincent Donnefort (2):
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Allow user-space mapping of the ring-buffer

 include/linux/ring_buffer.h     |   8 +
 include/uapi/linux/trace_mmap.h |  17 ++
 kernel/trace/ring_buffer.c      | 355 +++++++++++++++++++++++++++++++-
 kernel/trace/trace.c            |  74 ++++++-
 4 files changed, 441 insertions(+), 13 deletions(-)
 create mode 100644 include/uapi/linux/trace_mmap.h

-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-12-29 18:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-29 18:40 [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping Steven Rostedt
2023-12-29 18:40 ` [RFC][PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions Steven Rostedt
2023-12-29 18:40 ` [RFC][PATCH 2/2] tracing: Allow user-space mapping of the ring-buffer Steven Rostedt
2023-12-29 18:52 ` [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping Steven Rostedt
  -- strict thread matches above, loose matches on Subject: below --
2023-02-12 15:32 [RFC PATCH 0/2] Introducing trace buffer mapping by user-space Vincent Donnefort
2023-02-12 15:32 ` [RFC PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox