* [RFC PATCH 00/11] Tracefs support for pKVM
@ 2024-08-05 17:32 Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 01/11] ring-buffer: Check for empty ring-buffer with rb_num_of_entries() Vincent Donnefort
` (11 more replies)
0 siblings, 12 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
The growing set of features supported by the hypervisor in protected
mode necessitates debugging and profiling tools. Tracefs is the
ideal candidate for this task:
* It is simple to use and to script.
* It is supported by various tools, from the trace-cmd CLI to the
Android web-based perfetto.
* The ring-buffer, where are stored trace events consists of linked
pages, making it an ideal structure for sharing between kernel and
hypervisor.
This series introduces a method to create events and to generate them
from the hypervisor (hyp_enter/hyp_exit given as an example) as well as
a Tracefs user-space interface to read them.
A presentation was given on this matter during the tracing summit in
2022. [1]
1. ring-buffer
--------------
To setup the per-cpu ring-buffers, a new interface is created:
ring_buffer_writer: Describes what the kernel needs to know about the
writer, that is, the set of pages forming the
ring-buffer and a callback for the reader/head
swapping (enables consuming read)
ring_buffer_reader(): Creates a read-only ring-buffer from a
ring_buffer_writer.
To keep the internals of `struct ring_buffer` in sync with the writer,
the meta-page is used. It was originally introduced to enable user-space
mapping of the ring-buffer [1]. In this case, the kernel is not the
producer anymore but the reader. The function to read that meta-page is:
ring_buffer_poll_writer():
Update `struct ring_buffer` based on the writer
meta-page. Wake-up readers if necessary.
The kernel has to poll the meta-page to be notified of newly written
events.
2. Tracefs interface
--------------------
The interface is a hyp/ folder at the root of the tracefs mount point.
This folder is like an instance and you'll find there a subset of the
regular Tracefs user-space interface:
hyp/
buffer_size_kb
trace_pipe
trace_pipe_raw
trace
per_cpu/
cpuX/
trace_pipe
trace_pipe_raw
events/
hyp/
hyp_enter/
enable
id
Behind the scenes, kvm/hyp_trace.c must rebuild the tracing hierarchy
without relying on kernel/trace/trace.c. This is due to fundamental
differences:
* Hypervisor tracing doesn't support trace_array's system-specific
features (snapshots, tracers, etc.).
* Logged event formats differ (e.g., no PID in hypervisor
events).
* Buffer operations require specific hypervisor interactions.
3. Events
---------
In the hypervisor, "hyp events" can be generated with trace_<event_name>
in a similar fashion to what the kernel does. They're also created with
similar macros than the kernel (see kvm_hypevents.h)
HYP_EVENT("foboar",
HE_PROTO(void),
HE_STRUCT(),
HE_ASSIGN(),
HE_PRINTK(" ")
)
Despite the apparent similarities with TRACE_EVENT(), those macros
internally differs: they must be used in parallel between the hypervisor
(for the writing part) and the kernel (for the reading part) which makes
it difficult to share anything with their kernel counterpart.
Also, events directory isn't using eventfs.
4. Few limitations:
-------------------
Non consuming reading of the buffer isn't supported (i.e. cat trace) due
to the lack of support in the ring-buffer meta-page.
Tracing must be stopped for the buffer to be reset. i.e. (echo 0 >
tracing_on; echo 0 > trace)
[1] https://tracingsummit.org/ts/2022/hypervisortracing/
[2] https://lore.kernel.org/all/20240510140435.3550353-1-vdonnefort@google.com/
Vincent Donnefort (11):
ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
ring-buffer: Introducing ring-buffer writer
ring-buffer: Expose buffer_data_page material
timekeeping: Export the boot clock in snapshots
KVM: arm64: Support unaligned fixmap in the nVHE hyp
KVM: arm64: Add clock support in the nVHE hyp
KVM: arm64: Add tracing support for the pKVM hyp
KVM: arm64: Add hyp tracing to tracefs
KVM: arm64: Add raw interface for hyp tracefs
KVM: arm64: Add support for hyp events
KVM: arm64: Add kselftest for tracefs hyp tracefs
arch/arm64/include/asm/kvm_asm.h | 6 +
arch/arm64/include/asm/kvm_define_hypevents.h | 60 ++
arch/arm64/include/asm/kvm_hyp.h | 6 +
arch/arm64/include/asm/kvm_hypevents.h | 41 +
arch/arm64/include/asm/kvm_hypevents_defs.h | 41 +
arch/arm64/include/asm/kvm_hyptrace.h | 38 +
arch/arm64/kernel/image-vars.h | 4 +
arch/arm64/kernel/vmlinux.lds.S | 18 +
arch/arm64/kvm/Kconfig | 9 +
arch/arm64/kvm/Makefile | 2 +
arch/arm64/kvm/arm.c | 6 +
arch/arm64/kvm/hyp/hyp-constants.c | 4 +
arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h | 13 +
arch/arm64/kvm/hyp/include/nvhe/clock.h | 15 +
.../kvm/hyp/include/nvhe/define_events.h | 21 +
arch/arm64/kvm/hyp/include/nvhe/trace.h | 55 ++
arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
arch/arm64/kvm/hyp/nvhe/clock.c | 42 +
arch/arm64/kvm/hyp/nvhe/events.c | 35 +
arch/arm64/kvm/hyp/nvhe/ffa.c | 2 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 64 ++
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +
arch/arm64/kvm/hyp/nvhe/mm.c | 2 +-
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 14 +-
arch/arm64/kvm/hyp/nvhe/switch.c | 5 +-
arch/arm64/kvm/hyp/nvhe/trace.c | 594 ++++++++++++
arch/arm64/kvm/hyp_events.c | 164 ++++
arch/arm64/kvm/hyp_trace.c | 854 ++++++++++++++++++
arch/arm64/kvm/hyp_trace.h | 15 +
include/linux/ring_buffer.h | 124 ++-
include/linux/timekeeping.h | 6 +
kernel/time/timekeeping.c | 9 +
kernel/trace/ring_buffer.c | 244 +++--
tools/testing/selftests/hyp-trace/Makefile | 6 +
tools/testing/selftests/hyp-trace/config | 4 +
.../selftests/hyp-trace/hyp-trace-test | 161 ++++
36 files changed, 2591 insertions(+), 98 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_define_hypevents.h
create mode 100644 arch/arm64/include/asm/kvm_hypevents.h
create mode 100644 arch/arm64/include/asm/kvm_hypevents_defs.h
create mode 100644 arch/arm64/include/asm/kvm_hyptrace.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/clock.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/define_events.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/trace.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/clock.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/events.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/trace.c
create mode 100644 arch/arm64/kvm/hyp_events.c
create mode 100644 arch/arm64/kvm/hyp_trace.c
create mode 100644 arch/arm64/kvm/hyp_trace.h
create mode 100644 tools/testing/selftests/hyp-trace/Makefile
create mode 100644 tools/testing/selftests/hyp-trace/config
create mode 100644 tools/testing/selftests/hyp-trace/hyp-trace-test
base-commit: e4fc196f5ba36eb7b9758cf2c73df49a44199895
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 01/11] ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer Vincent Donnefort
` (10 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
Currently there are two ways of identifying an empty ring-buffer. One
relying on the current status of the commit / reader page
(rb_per_cpu_empty()) and the other on the write and read counters
(rb_num_of_entries() used in rb_get_reader_page()).
with rb_num_of_entries(). This intends to ease later
introduction of ring-buffer writers which are out of the kernel control
and with whom, the only information available is through the meta-page
counters.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28853966aa9a..f4f4dda28077 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3953,40 +3953,22 @@ int ring_buffer_write(struct trace_buffer *buffer,
}
EXPORT_SYMBOL_GPL(ring_buffer_write);
-static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+/*
+ * The total entries in the ring buffer is the running counter
+ * of entries entered into the ring buffer, minus the sum of
+ * the entries read from the ring buffer and the number of
+ * entries that were overwritten.
+ */
+static inline unsigned long
+rb_num_of_entries(struct ring_buffer_per_cpu *cpu_buffer)
{
- struct buffer_page *reader = cpu_buffer->reader_page;
- struct buffer_page *head = rb_set_head_page(cpu_buffer);
- struct buffer_page *commit = cpu_buffer->commit_page;
-
- /* In case of error, head will be NULL */
- if (unlikely(!head))
- return true;
-
- /* Reader should exhaust content in reader page */
- if (reader->read != rb_page_size(reader))
- return false;
-
- /*
- * If writers are committing on the reader page, knowing all
- * committed content has been read, the ring buffer is empty.
- */
- if (commit == reader)
- return true;
-
- /*
- * If writers are committing on a page other than reader page
- * and head page, there should always be content to read.
- */
- if (commit != head)
- return false;
+ return local_read(&cpu_buffer->entries) -
+ (local_read(&cpu_buffer->overrun) + cpu_buffer->read);
+}
- /*
- * Writers are committing on the head page, we just need
- * to care about there're committed data, and the reader will
- * swap reader page with head page when it is to read data.
- */
- return rb_page_commit(commit) == 0;
+static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+ return !rb_num_of_entries(cpu_buffer);
}
/**
@@ -4132,19 +4114,6 @@ void ring_buffer_record_enable_cpu(struct trace_buffer *buffer, int cpu)
}
EXPORT_SYMBOL_GPL(ring_buffer_record_enable_cpu);
-/*
- * The total entries in the ring buffer is the running counter
- * of entries entered into the ring buffer, minus the sum of
- * the entries read from the ring buffer and the number of
- * entries that were overwritten.
- */
-static inline unsigned long
-rb_num_of_entries(struct ring_buffer_per_cpu *cpu_buffer)
-{
- return local_read(&cpu_buffer->entries) -
- (local_read(&cpu_buffer->overrun) + cpu_buffer->read);
-}
-
/**
* ring_buffer_oldest_event_ts - get the oldest event timestamp from the buffer
* @buffer: The ring buffer
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 01/11] ring-buffer: Check for empty ring-buffer with rb_num_of_entries() Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-06 20:39 ` Steven Rostedt
2024-08-05 17:32 ` [RFC PATCH 03/11] ring-buffer: Expose buffer_data_page material Vincent Donnefort
` (9 subsequent siblings)
11 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
A ring-buffer writer is an entity outside of the kernel (most likely a
firmware or a hypervisor) capable of writing events in a ring-buffer
following the same format as the tracefs ring-buffer.
To setup the ring-buffer on the kernel side, a description of the pages
(struct trace_page_desc) is necessary. A callback (get_reader_page) must
also be provided. It is called whenever it is done reading the previous
reader page.
It is expected from the writer to keep the meta-page updated.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 96d2140b471e..d3ea2b40437c 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -83,21 +83,24 @@ u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer,
void ring_buffer_discard_commit(struct trace_buffer *buffer,
struct ring_buffer_event *event);
+struct ring_buffer_writer;
+
/*
* size is in bytes for each per CPU buffer.
*/
struct trace_buffer *
-__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key);
+__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key,
+ struct ring_buffer_writer *writer);
/*
* Because the ring buffer is generic, if other users of the ring buffer get
* traced by ftrace, it can produce lockdep warnings. We need to keep each
* ring buffer's lock class separate.
*/
-#define ring_buffer_alloc(size, flags) \
-({ \
- static struct lock_class_key __key; \
- __ring_buffer_alloc((size), (flags), &__key); \
+#define ring_buffer_alloc(size, flags) \
+({ \
+ static struct lock_class_key __key; \
+ __ring_buffer_alloc((size), (flags), &__key, NULL); \
})
typedef bool (*ring_buffer_cond_fn)(void *data);
@@ -229,4 +232,70 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
struct vm_area_struct *vma);
int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
+
+#define meta_pages_lost(__meta) \
+ ((__meta)->Reserved1)
+#define meta_pages_touched(__meta) \
+ ((__meta)->Reserved2)
+
+struct rb_page_desc {
+ int cpu;
+ int nr_page_va; /* exclude the meta page */
+ unsigned long meta_va;
+ unsigned long page_va[];
+};
+
+struct trace_page_desc {
+ int nr_cpus;
+ char __data[]; /* list of rb_page_desc */
+};
+
+static inline
+struct rb_page_desc *__next_rb_page_desc(struct rb_page_desc *pdesc)
+{
+ size_t len = struct_size(pdesc, page_va, pdesc->nr_page_va);
+
+ return (struct rb_page_desc *)((void *)pdesc + len);
+}
+
+#define for_each_rb_page_desc(__pdesc, __cpu, __trace_pdesc) \
+ for (__pdesc = (struct rb_page_desc *)&((__trace_pdesc)->__data[0]), __cpu = 0; \
+ __cpu < (__trace_pdesc)->nr_cpus; \
+ __cpu++, __pdesc = __next_rb_page_desc(__pdesc))
+
+static inline
+struct rb_page_desc *rb_page_desc(struct trace_page_desc *trace_pdesc, int cpu)
+{
+ struct rb_page_desc *pdesc;
+ int i;
+
+ if (!trace_pdesc)
+ return NULL;
+
+ for_each_rb_page_desc(pdesc, i, trace_pdesc) {
+ if (pdesc->cpu == cpu)
+ return pdesc;
+ }
+
+ return NULL;
+}
+
+static inline
+void *rb_page_desc_page(struct rb_page_desc *pdesc, int page_id)
+{
+ return page_id > pdesc->nr_page_va ? NULL : (void *)pdesc->page_va[page_id];
+}
+
+struct ring_buffer_writer {
+ struct trace_page_desc *pdesc;
+ int (*get_reader_page)(int cpu);
+};
+
+int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu);
+
+#define ring_buffer_reader(writer) \
+({ \
+ static struct lock_class_key __key; \
+ __ring_buffer_alloc(0, RB_FL_OVERWRITE, &__key, writer);\
+})
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4f4dda28077..a07c22254cfd 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -495,6 +495,8 @@ struct ring_buffer_per_cpu {
unsigned long *subbuf_ids; /* ID to subbuf VA */
struct trace_buffer_meta *meta_page;
+ struct ring_buffer_writer *writer;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
long nr_pages_to_update;
struct list_head new_pages; /* new pages to add */
@@ -517,6 +519,8 @@ struct trace_buffer {
struct ring_buffer_per_cpu **buffers;
+ struct ring_buffer_writer *writer;
+
struct hlist_node node;
u64 (*clock)(void);
@@ -1626,6 +1630,31 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
cpu_buffer->reader_page = bpage;
+ if (buffer->writer) {
+ struct rb_page_desc *pdesc = rb_page_desc(buffer->writer->pdesc, cpu);
+
+ if (!pdesc)
+ goto fail_free_reader;
+
+ cpu_buffer->writer = buffer->writer;
+ cpu_buffer->meta_page = (struct trace_buffer_meta *)(void *)pdesc->meta_va;
+ cpu_buffer->subbuf_ids = pdesc->page_va;
+ cpu_buffer->nr_pages = pdesc->nr_page_va - 1;
+ atomic_inc(&cpu_buffer->record_disabled);
+ atomic_inc(&cpu_buffer->resize_disabled);
+
+ bpage->page = rb_page_desc_page(pdesc,
+ cpu_buffer->meta_page->reader.id);
+ if (!bpage->page)
+ goto fail_free_reader;
+ /*
+ * The meta-page can only describe which of the ring-buffer page
+ * is the reader. There is no need to init the rest of the
+ * ring-buffer.
+ */
+ return cpu_buffer;
+ }
+
page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
@@ -1663,6 +1692,10 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
irq_work_sync(&cpu_buffer->irq_work.work);
+ if (cpu_buffer->writer)
+ /* the ring_buffer doesn't own the data pages */
+ cpu_buffer->reader_page->page = NULL;
+
free_buffer_page(cpu_buffer->reader_page);
if (head) {
@@ -1693,7 +1726,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
* drop data when the tail hits the head.
*/
struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
- struct lock_class_key *key)
+ struct lock_class_key *key,
+ struct ring_buffer_writer *writer)
{
struct trace_buffer *buffer;
long nr_pages;
@@ -1721,6 +1755,10 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
+ if (writer) {
+ buffer->writer = writer;
+ atomic_inc(&buffer->record_disabled);
+ }
init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&buffer->irq_work.waiters);
@@ -4468,8 +4506,54 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
}
}
+static bool rb_read_writer_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+ local_set(&cpu_buffer->entries, READ_ONCE(cpu_buffer->meta_page->entries));
+ local_set(&cpu_buffer->overrun, READ_ONCE(cpu_buffer->meta_page->overrun));
+ local_set(&cpu_buffer->pages_touched, READ_ONCE(meta_pages_touched(cpu_buffer->meta_page)));
+ local_set(&cpu_buffer->pages_lost, READ_ONCE(meta_pages_lost(cpu_buffer->meta_page)));
+ /*
+ * No need to get the "read" field, it can be tracked here as any
+ * reader will have to go through a rign_buffer_per_cpu.
+ */
+
+ return rb_num_of_entries(cpu_buffer);
+}
+
+static struct buffer_page *
+__rb_get_reader_page_from_writer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+ u32 prev_reader;
+
+ if (!rb_read_writer_meta_page(cpu_buffer))
+ return NULL;
+
+ /* More to read on the reader page */
+ if (cpu_buffer->reader_page->read < rb_page_size(cpu_buffer->reader_page))
+ return cpu_buffer->reader_page;
+
+ prev_reader = cpu_buffer->meta_page->reader.id;
+
+ WARN_ON(cpu_buffer->writer->get_reader_page(cpu_buffer->cpu));
+ /* nr_pages doesn't include the reader page */
+ if (cpu_buffer->meta_page->reader.id > cpu_buffer->nr_pages) {
+ WARN_ON(1);
+ return NULL;
+ }
+
+ cpu_buffer->reader_page->page =
+ (void *)cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id];
+ cpu_buffer->reader_page->read = 0;
+ cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp;
+ cpu_buffer->lost_events = cpu_buffer->meta_page->reader.lost_events;
+
+ WARN_ON(prev_reader == cpu_buffer->meta_page->reader.id);
+
+ return cpu_buffer->reader_page;
+}
+
static struct buffer_page *
-rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
+__rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
{
struct buffer_page *reader = NULL;
unsigned long bsize = READ_ONCE(cpu_buffer->buffer->subbuf_size);
@@ -4636,6 +4720,13 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
return reader;
}
+static struct buffer_page *
+rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+ return cpu_buffer->writer ? __rb_get_reader_page_from_writer(cpu_buffer) :
+ __rb_get_reader_page(cpu_buffer);
+}
+
static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
{
struct ring_buffer_event *event;
@@ -5040,7 +5131,7 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags)
struct ring_buffer_per_cpu *cpu_buffer;
struct ring_buffer_iter *iter;
- if (!cpumask_test_cpu(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->writer)
return NULL;
iter = kzalloc(sizeof(*iter), flags);
@@ -5210,6 +5301,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
{
struct buffer_page *page;
+ if (cpu_buffer->writer)
+ return;
+
rb_head_page_deactivate(cpu_buffer);
cpu_buffer->head_page
@@ -5440,6 +5534,49 @@ bool ring_buffer_empty_cpu(struct trace_buffer *buffer, int cpu)
}
EXPORT_SYMBOL_GPL(ring_buffer_empty_cpu);
+int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ unsigned long flags;
+
+ if (cpu != RING_BUFFER_ALL_CPUS) {
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
+ return -EINVAL;
+
+ cpu_buffer = buffer->buffers[cpu];
+
+ raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+ if (rb_read_writer_meta_page(cpu_buffer))
+ rb_wakeups(buffer, cpu_buffer);
+ raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+
+ return 0;
+ }
+
+ /*
+ * Make sure all the ring buffers are up to date before we start reading
+ * them.
+ */
+ for_each_buffer_cpu(buffer, cpu) {
+ cpu_buffer = buffer->buffers[cpu];
+
+ raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+ rb_read_writer_meta_page(buffer->buffers[cpu]);
+ raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+ }
+
+ for_each_buffer_cpu(buffer, cpu) {
+ cpu_buffer = buffer->buffers[cpu];
+
+ raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+ if (rb_num_of_entries(cpu_buffer))
+ rb_wakeups(buffer, buffer->buffers[cpu]);
+ raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+ }
+
+ return 0;
+}
+
#ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
/**
* ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
@@ -5691,6 +5828,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
unsigned int commit;
unsigned int read;
u64 save_timestamp;
+ bool force_memcpy;
int ret = -1;
if (!cpumask_test_cpu(cpu, buffer->cpumask))
@@ -5728,6 +5866,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
/* Check if any events were dropped */
missed_events = cpu_buffer->lost_events;
+ force_memcpy = cpu_buffer->mapped || cpu_buffer->writer;
+
/*
* If this page has been partially read or
* if len is not big enough to read the rest of the page or
@@ -5737,7 +5877,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
*/
if (read || (len < (commit - read)) ||
cpu_buffer->reader_page == cpu_buffer->commit_page ||
- cpu_buffer->mapped) {
+ force_memcpy) {
struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
unsigned int rpos = read;
unsigned int pos = 0;
@@ -6290,7 +6430,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
unsigned long flags, *subbuf_ids;
int err = 0;
- if (!cpumask_test_cpu(cpu, buffer->cpumask))
+ if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->writer)
return -EINVAL;
cpu_buffer = buffer->buffers[cpu];
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 03/11] ring-buffer: Expose buffer_data_page material
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 01/11] ring-buffer: Check for empty ring-buffer with rb_num_of_entries() Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots Vincent Donnefort
` (8 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
In preparation for allowing the write of ring-buffer compliant pages
outside of ring_buffer.c, move to the header, struct buffer_data_page and
timestamp encoding functions into the publicly available ring_buffer.h.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index d3ea2b40437c..cb2a266d1f2e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -3,8 +3,10 @@
#define _LINUX_RING_BUFFER_H
#include <linux/mm.h>
-#include <linux/seq_file.h>
#include <linux/poll.h>
+#include <linux/seq_file.h>
+
+#include <asm/local.h>
#include <uapi/linux/trace_mmap.h>
@@ -20,6 +22,8 @@ struct ring_buffer_event {
u32 array[];
};
+#define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array))
+
/**
* enum ring_buffer_type - internal ring buffer types
*
@@ -61,11 +65,50 @@ enum ring_buffer_type {
RINGBUF_TYPE_TIME_STAMP,
};
+#define TS_SHIFT 27
+#define TS_MASK ((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST (~TS_MASK)
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline bool test_time_stamp(u64 delta)
+{
+ return !!(delta & TS_DELTA_TEST);
+}
+
unsigned ring_buffer_event_length(struct ring_buffer_event *event);
void *ring_buffer_event_data(struct ring_buffer_event *event);
u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer,
struct ring_buffer_event *event);
+#define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data)
+
+/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
+#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
+
+#define RB_ALIGNMENT 4U
+#define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
+#define RB_EVNT_MIN_SIZE 8U /* two 32bit words */
+
+#ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS
+# define RB_FORCE_8BYTE_ALIGNMENT 0
+# define RB_ARCH_ALIGNMENT RB_ALIGNMENT
+#else
+# define RB_FORCE_8BYTE_ALIGNMENT 1
+# define RB_ARCH_ALIGNMENT 8U
+#endif
+
+#define RB_ALIGN_DATA __aligned(RB_ARCH_ALIGNMENT)
+
+struct buffer_data_page {
+ u64 time_stamp; /* page time stamp */
+ local_t commit; /* write committed index */
+ unsigned char data[] RB_ALIGN_DATA; /* data of buffer page */
+};
+
/*
* ring_buffer_discard_commit will remove an event that has not
* been committed yet. If this is used, then ring_buffer_unlock_commit
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a07c22254cfd..6b00b9a68b7a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -135,23 +135,6 @@ int ring_buffer_print_entry_header(struct trace_seq *s)
/* Used for individual buffers (after the counter) */
#define RB_BUFFER_OFF (1 << 20)
-#define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data)
-
-#define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array))
-#define RB_ALIGNMENT 4U
-#define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
-#define RB_EVNT_MIN_SIZE 8U /* two 32bit words */
-
-#ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS
-# define RB_FORCE_8BYTE_ALIGNMENT 0
-# define RB_ARCH_ALIGNMENT RB_ALIGNMENT
-#else
-# define RB_FORCE_8BYTE_ALIGNMENT 1
-# define RB_ARCH_ALIGNMENT 8U
-#endif
-
-#define RB_ALIGN_DATA __aligned(RB_ARCH_ALIGNMENT)
-
/* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */
#define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX
@@ -294,10 +277,6 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);
#define for_each_online_buffer_cpu(buffer, cpu) \
for_each_cpu_and(cpu, buffer->cpumask, cpu_online_mask)
-#define TS_SHIFT 27
-#define TS_MASK ((1ULL << TS_SHIFT) - 1)
-#define TS_DELTA_TEST (~TS_MASK)
-
static u64 rb_event_time_stamp(struct ring_buffer_event *event)
{
u64 ts;
@@ -316,12 +295,6 @@ static u64 rb_event_time_stamp(struct ring_buffer_event *event)
#define RB_MISSED_MASK (3 << 30)
-struct buffer_data_page {
- u64 time_stamp; /* page time stamp */
- local_t commit; /* write committed index */
- unsigned char data[] RB_ALIGN_DATA; /* data of buffer page */
-};
-
struct buffer_data_read_page {
unsigned order; /* order of the page */
struct buffer_data_page *data; /* actual data, stored in this page */
@@ -377,14 +350,6 @@ static void free_buffer_page(struct buffer_page *bpage)
kfree(bpage);
}
-/*
- * We need to fit the time_stamp delta into 27 bits.
- */
-static inline bool test_time_stamp(u64 delta)
-{
- return !!(delta & TS_DELTA_TEST);
-}
-
struct rb_irq_work {
struct irq_work work;
wait_queue_head_t waiters;
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (2 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 03/11] ring-buffer: Expose buffer_data_page material Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-22 9:13 ` Marc Zyngier
2024-08-22 18:13 ` John Stultz
2024-08-05 17:32 ` [RFC PATCH 05/11] KVM: arm64: Support unaligned fixmap in the nVHE hyp Vincent Donnefort
` (7 subsequent siblings)
11 siblings, 2 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
On arm64 systems, the arch timer can be accessible by both EL1 and EL2.
This means when running with nVHE or protected KVM, it is easy to
generate clock values from the hypervisor, synchronized with the kernel.
For tracing purpose, the boot clock is interesting as it doesn't stop on
suspend. Export it as part of the time snapshot. This will later allow
the hypervisor to add boot clock timestamps to its events.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fc12a9ba2c88..0fc6a61d64bd 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -275,18 +275,24 @@ struct ktime_timestamps {
* counter value
* @cycles: Clocksource counter value to produce the system times
* @real: Realtime system time
+ * @boot: Boot time
* @raw: Monotonic raw system time
* @cs_id: Clocksource ID
* @clock_was_set_seq: The sequence number of clock-was-set events
* @cs_was_changed_seq: The sequence number of clocksource change events
+ * @mono_shift: The monotonic clock slope shift
+ * @mono_mult: The monotonic clock slope mult
*/
struct system_time_snapshot {
u64 cycles;
ktime_t real;
+ ktime_t boot;
ktime_t raw;
enum clocksource_ids cs_id;
unsigned int clock_was_set_seq;
u8 cs_was_changed_seq;
+ u32 mono_shift;
+ u32 mono_mult;
};
/**
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 2fa87dcfeda9..6d0488a555a7 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1057,9 +1057,11 @@ noinstr time64_t __ktime_get_real_seconds(void)
void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
{
struct timekeeper *tk = &tk_core.timekeeper;
+ u32 mono_mult, mono_shift;
unsigned int seq;
ktime_t base_raw;
ktime_t base_real;
+ ktime_t base_boot;
u64 nsec_raw;
u64 nsec_real;
u64 now;
@@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
base_real = ktime_add(tk->tkr_mono.base,
tk_core.timekeeper.offs_real);
+ base_boot = ktime_add(tk->tkr_mono.base,
+ tk_core.timekeeper.offs_boot);
base_raw = tk->tkr_raw.base;
nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
+ mono_mult = tk->tkr_mono.mult;
+ mono_shift = tk->tkr_mono.shift;
} while (read_seqcount_retry(&tk_core.seq, seq));
systime_snapshot->cycles = now;
systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
+ systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
+ systime_snapshot->mono_shift = mono_shift;
+ systime_snapshot->mono_mult = mono_mult;
}
EXPORT_SYMBOL_GPL(ktime_get_snapshot);
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 05/11] KVM: arm64: Support unaligned fixmap in the nVHE hyp
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (3 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 06/11] KVM: arm64: Add clock support " Vincent Donnefort
` (6 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
Return the fixmap VA with the page offset, instead of the page base
address.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index 8850b591d775..c5a9d8874eb2 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -240,7 +240,7 @@ void *hyp_fixmap_map(phys_addr_t phys)
WRITE_ONCE(*ptep, pte);
dsb(ishst);
- return (void *)slot->addr;
+ return (void *)slot->addr + offset_in_page(phys);
}
static void fixmap_clear_slot(struct hyp_fixmap_slot *slot)
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 06/11] KVM: arm64: Add clock support in the nVHE hyp
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (4 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 05/11] KVM: arm64: Support unaligned fixmap in the nVHE hyp Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 07/11] KVM: arm64: Add tracing support for the pKVM hyp Vincent Donnefort
` (5 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
By default, the arm64 host kernel is using the arch timer as a source
for sched_clock. Conveniently, EL2 has access to that same counter,
allowing to generate clock values that are synchronized.
The clock needs nonetheless to be setup with the same slope values as
the kernel. Introducing at the same time trace_clock() which is expected
to be later configured by the hypervisor tracing.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index c838309e4ec4..9105aba0ad00 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -145,4 +145,10 @@ extern unsigned long kvm_nvhe_sym(__icache_flags);
extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits);
extern unsigned int kvm_nvhe_sym(kvm_host_sve_max_vl);
+struct kvm_nvhe_clock_data {
+ u32 mult;
+ u32 shift;
+ u64 epoch_ns;
+ u64 epoch_cyc;
+};
#endif /* __ARM64_KVM_HYP_H__ */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/clock.h b/arch/arm64/kvm/hyp/include/nvhe/clock.h
new file mode 100644
index 000000000000..7e5c2d2b1886
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/clock.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARM64_KVM_HYP_NVHE_CLOCK_H
+#define __ARM64_KVM_HYP_NVHE_CLOCK_H
+#include <linux/types.h>
+
+#include <asm/kvm_hyp.h>
+
+#ifdef CONFIG_TRACING
+void trace_clock_update(struct kvm_nvhe_clock_data *data);
+u64 trace_clock(void);
+#else
+static inline void trace_clock_update(struct kvm_nvhe_clock_data *data) { }
+static inline u64 trace_clock(void) { return 0; }
+#endif
+#endif
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 782b34b004be..fde8ac1abe7b 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -26,6 +26,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o
hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
+hyp-obj-$(CONFIG_TRACING) += clock.o
hyp-obj-y += $(lib-objs)
##
diff --git a/arch/arm64/kvm/hyp/nvhe/clock.c b/arch/arm64/kvm/hyp/nvhe/clock.c
new file mode 100644
index 000000000000..4ff87e86787c
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/clock.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <nvhe/clock.h>
+
+#include <asm/arch_timer.h>
+#include <asm/div64.h>
+
+static struct kvm_nvhe_clock_data trace_clock_data;
+
+/*
+ * Update without any locks! This is fine because tracing, the sole user of this
+ * clock is ordering the memory and protects from races between read and
+ * updates.
+ */
+void trace_clock_update(struct kvm_nvhe_clock_data *data)
+{
+ trace_clock_data.mult = data->mult;
+ trace_clock_data.shift = data->shift;
+ trace_clock_data.epoch_ns = data->epoch_ns;
+ trace_clock_data.epoch_cyc = data->epoch_cyc;
+}
+
+/*
+ * This clock is relying on host provided slope and epoch values to return
+ * something synchronized with the host. The downside is we can't trust the
+ * output which must not be used for anything else than debugging.
+ */
+u64 trace_clock(void)
+{
+ u64 cyc = __arch_counter_get_cntpct() - trace_clock_data.epoch_cyc;
+ __uint128_t ns;
+
+ /*
+ * The host kernel can avoid the 64-bits overflow of the multiplication
+ * by updating the epoch value with a timer (see
+ * kernel/time/clocksource.c). The hypervisor doesn't have that option,
+ * so let's do a more costly 128-bits mult here.
+ */
+ ns = (__uint128_t)cyc * trace_clock_data.mult;
+ ns >>= trace_clock_data.shift;
+
+ return (u64)ns + trace_clock_data.epoch_ns;
+}
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 07/11] KVM: arm64: Add tracing support for the pKVM hyp
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (5 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 06/11] KVM: arm64: Add clock support " Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 08/11] KVM: arm64: Add hyp tracing to tracefs Vincent Donnefort
` (4 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
When running with protected mode, the host has very little knowledge
about what is happening in the hypervisor. Of course this is an
essential feature for security but nonetheless, that piece of code
growing with more responsibilities, we need now a way to debug and
profile it. Tracefs by its reliatility, versatility and support for
user-space is the perfect tool.
There's no way the hypervisor could log events directly into the host
tracefs ring-buffers. So instead let's use our own, where the hypervisor
is the writer and the host the reader.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 2181a11b9d92..d549d7d491c3 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -79,6 +79,10 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
+ __KVM_HOST_SMCCC_FUNC___pkvm_load_tracing,
+ __KVM_HOST_SMCCC_FUNC___pkvm_teardown_tracing,
+ __KVM_HOST_SMCCC_FUNC___pkvm_enable_tracing,
+ __KVM_HOST_SMCCC_FUNC___pkvm_swap_reader_tracing,
};
#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
diff --git a/arch/arm64/include/asm/kvm_hyptrace.h b/arch/arm64/include/asm/kvm_hyptrace.h
new file mode 100644
index 000000000000..b371776da0fe
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_hyptrace.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ARM64_KVM_HYPTRACE_H_
+#define __ARM64_KVM_HYPTRACE_H_
+#include <asm/kvm_hyp.h>
+
+#include <linux/ring_buffer.h>
+
+/*
+ * Host donations to the hypervisor to store the struct hyp_buffer_page.
+ */
+struct hyp_buffer_pages_backing {
+ unsigned long start;
+ size_t size;
+};
+
+struct hyp_trace_desc {
+ struct hyp_buffer_pages_backing backing;
+ struct kvm_nvhe_clock_data clock_data;
+ struct trace_page_desc page_desc;
+
+};
+#endif
diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h
new file mode 100644
index 000000000000..29e0c7a6c867
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ARM64_KVM_HYP_NVHE_TRACE_H
+#define __ARM64_KVM_HYP_NVHE_TRACE_H
+#include <asm/kvm_hyptrace.h>
+
+/* Internal struct that needs export for hyp-constants.c */
+struct hyp_buffer_page {
+ struct list_head list;
+ struct buffer_data_page *page;
+ unsigned long write;
+ unsigned long entries;
+ u32 id;
+};
+
+#ifdef CONFIG_TRACING
+void *tracing_reserve_entry(unsigned long length);
+void tracing_commit_entry(void);
+
+int __pkvm_load_tracing(unsigned long desc_va, size_t desc_size);
+void __pkvm_teardown_tracing(void);
+int __pkvm_enable_tracing(bool enable);
+int __pkvm_swap_reader_tracing(int cpu);
+#else
+static inline void *tracing_reserve_entry(unsigned long length) { return NULL; }
+static inline void tracing_commit_entry(void) { }
+
+static inline int __pkvm_load_tracing(unsigned long desc_va, size_t desc_size) { return -ENODEV; }
+static inline void __pkvm_teardown_tracing(void) { }
+static inline int __pkvm_enable_tracing(bool enable) { return -ENODEV; }
+static inline int __pkvm_swap_reader_tracing(int cpu) { return -ENODEV; }
+#endif
+#endif
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index fde8ac1abe7b..1a5367c13407 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -26,7 +26,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o
hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
-hyp-obj-$(CONFIG_TRACING) += clock.o
+hyp-obj-$(CONFIG_TRACING) += clock.o trace.o
hyp-obj-y += $(lib-objs)
##
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index f43d845f3c4e..d3d01c02d35c 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -17,6 +17,7 @@
#include <nvhe/mem_protect.h>
#include <nvhe/mm.h>
#include <nvhe/pkvm.h>
+#include <nvhe/trace.h>
#include <nvhe/trap_handler.h>
DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
@@ -373,6 +374,35 @@ static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt)
cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle);
}
+static void handle___pkvm_load_tracing(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(unsigned long, desc_hva, host_ctxt, 1);
+ DECLARE_REG(size_t, desc_size, host_ctxt, 2);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_load_tracing(desc_hva, desc_size);
+}
+
+static void handle___pkvm_teardown_tracing(struct kvm_cpu_context *host_ctxt)
+{
+ __pkvm_teardown_tracing();
+
+ cpu_reg(host_ctxt, 1) = 0;
+}
+
+static void handle___pkvm_enable_tracing(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(bool, enable, host_ctxt, 1);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_enable_tracing(enable);
+}
+
+static void handle___pkvm_swap_reader_tracing(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(int, cpu, host_ctxt, 1);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_swap_reader_tracing(cpu);
+}
+
typedef void (*hcall_t)(struct kvm_cpu_context *);
#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -405,6 +435,10 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__pkvm_init_vm),
HANDLE_FUNC(__pkvm_init_vcpu),
HANDLE_FUNC(__pkvm_teardown_vm),
+ HANDLE_FUNC(__pkvm_load_tracing),
+ HANDLE_FUNC(__pkvm_teardown_tracing),
+ HANDLE_FUNC(__pkvm_enable_tracing),
+ HANDLE_FUNC(__pkvm_swap_reader_tracing),
};
static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/trace.c b/arch/arm64/kvm/hyp/nvhe/trace.c
new file mode 100644
index 000000000000..7bf3af5fe725
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/trace.c
@@ -0,0 +1,594 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ * Author: Vincent Donnefort <vdonnefort@google.com>
+ */
+
+#include <nvhe/clock.h>
+#include <nvhe/mem_protect.h>
+#include <nvhe/mm.h>
+#include <nvhe/trace.h>
+
+#include <asm/percpu.h>
+#include <asm/kvm_mmu.h>
+#include <asm/local.h>
+
+#define HYP_RB_PAGE_HEAD 1UL
+#define HYP_RB_PAGE_UPDATE 2UL
+#define HYP_RB_FLAG_MASK 3UL
+
+struct hyp_rb_per_cpu {
+ struct trace_buffer_meta *meta;
+ struct hyp_buffer_page *tail_page;
+ struct hyp_buffer_page *reader_page;
+ struct hyp_buffer_page *head_page;
+ struct hyp_buffer_page *bpages;
+ unsigned long nr_pages;
+ unsigned long last_overrun;
+ u64 write_stamp;
+ atomic_t status;
+};
+
+#define HYP_RB_UNAVAILABLE 0
+#define HYP_RB_READY 1
+#define HYP_RB_WRITING 2
+
+static struct hyp_buffer_pages_backing hyp_buffer_pages_backing;
+DEFINE_PER_CPU(struct hyp_rb_per_cpu, trace_rb);
+DEFINE_HYP_SPINLOCK(trace_rb_lock);
+
+static bool rb_set_flag(struct hyp_buffer_page *bpage, int new_flag)
+{
+ unsigned long ret, val = (unsigned long)bpage->list.next;
+
+ ret = cmpxchg((unsigned long *)&bpage->list.next,
+ val, (val & ~HYP_RB_FLAG_MASK) | new_flag);
+
+ return ret == val;
+}
+
+static struct hyp_buffer_page *rb_hyp_buffer_page(struct list_head *list)
+{
+ unsigned long ptr = (unsigned long)list & ~HYP_RB_FLAG_MASK;
+
+ return container_of((struct list_head *)ptr, struct hyp_buffer_page, list);
+}
+
+static struct hyp_buffer_page *rb_next_page(struct hyp_buffer_page *bpage)
+{
+ return rb_hyp_buffer_page(bpage->list.next);
+}
+
+static bool rb_is_head_page(struct hyp_buffer_page *bpage)
+{
+ return (unsigned long)bpage->list.prev->next & HYP_RB_PAGE_HEAD;
+}
+
+static struct hyp_buffer_page *rb_set_head_page(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ struct hyp_buffer_page *bpage, *prev_head;
+ int cnt = 0;
+
+again:
+ bpage = prev_head = cpu_buffer->head_page;
+ do {
+ if (rb_is_head_page(bpage)) {
+ cpu_buffer->head_page = bpage;
+ return bpage;
+ }
+
+ bpage = rb_next_page(bpage);
+ } while (bpage != prev_head);
+
+ /* We might have race with the writer let's try again */
+ if (++cnt < 3)
+ goto again;
+
+ return NULL;
+}
+
+static int rb_swap_reader_page(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ unsigned long *old_head_link, old_link_val, new_link_val, overrun;
+ struct hyp_buffer_page *head, *reader = cpu_buffer->reader_page;
+
+spin:
+ /* Update the cpu_buffer->header_page according to HYP_RB_PAGE_HEAD */
+ head = rb_set_head_page(cpu_buffer);
+ if (!head)
+ return -ENODEV;
+
+ /* Connect the reader page around the header page */
+ reader->list.next = head->list.next;
+ reader->list.prev = head->list.prev;
+
+ /* The reader page points to the new header page */
+ rb_set_flag(reader, HYP_RB_PAGE_HEAD);
+
+ /*
+ * Paired with the cmpxchg in rb_move_tail(). Order the read of the head
+ * page and overrun.
+ */
+ smp_mb();
+ overrun = READ_ONCE(cpu_buffer->meta->overrun);
+
+ /* Try to swap the prev head link to the reader page */
+ old_head_link = (unsigned long *)&reader->list.prev->next;
+ old_link_val = (*old_head_link & ~HYP_RB_FLAG_MASK) | HYP_RB_PAGE_HEAD;
+ new_link_val = (unsigned long)&reader->list;
+ if (cmpxchg(old_head_link, old_link_val, new_link_val)
+ != old_link_val)
+ goto spin;
+
+ cpu_buffer->head_page = rb_hyp_buffer_page(reader->list.next);
+ cpu_buffer->head_page->list.prev = &reader->list;
+ cpu_buffer->reader_page = head;
+ cpu_buffer->meta->reader.lost_events = overrun - cpu_buffer->last_overrun;
+ cpu_buffer->meta->reader.id = cpu_buffer->reader_page->id;
+ cpu_buffer->last_overrun = overrun;
+
+ return 0;
+}
+
+static struct hyp_buffer_page *
+rb_move_tail(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ struct hyp_buffer_page *tail_page, *new_tail, *new_head;
+
+ tail_page = cpu_buffer->tail_page;
+ new_tail = rb_next_page(tail_page);
+
+again:
+ /*
+ * We caught the reader ... Let's try to move the head page.
+ * The writer can only rely on ->next links to check if this is head.
+ */
+ if ((unsigned long)tail_page->list.next & HYP_RB_PAGE_HEAD) {
+ /* The reader moved the head in between */
+ if (!rb_set_flag(tail_page, HYP_RB_PAGE_UPDATE))
+ goto again;
+
+ WRITE_ONCE(cpu_buffer->meta->overrun,
+ cpu_buffer->meta->overrun + new_tail->entries);
+ WRITE_ONCE(meta_pages_lost(cpu_buffer->meta),
+ meta_pages_lost(cpu_buffer->meta) + 1);
+
+ /* Move the head */
+ rb_set_flag(new_tail, HYP_RB_PAGE_HEAD);
+
+ /* The new head is in place, reset the update flag */
+ rb_set_flag(tail_page, 0);
+
+ new_head = rb_next_page(new_tail);
+ }
+
+ local_set(&new_tail->page->commit, 0);
+
+ new_tail->write = 0;
+ new_tail->entries = 0;
+
+ WRITE_ONCE(meta_pages_touched(cpu_buffer->meta),
+ meta_pages_touched(cpu_buffer->meta) + 1);
+ cpu_buffer->tail_page = new_tail;
+
+ return new_tail;
+}
+
+static unsigned long rb_event_size(unsigned long length)
+{
+ struct ring_buffer_event *event;
+
+ return length + RB_EVNT_HDR_SIZE + sizeof(event->array[0]);
+}
+
+static struct ring_buffer_event *
+rb_add_ts_extend(struct ring_buffer_event *event, u64 delta)
+{
+ event->type_len = RINGBUF_TYPE_TIME_EXTEND;
+ event->time_delta = delta & TS_MASK;
+ event->array[0] = delta >> TS_SHIFT;
+
+ return (struct ring_buffer_event *)((unsigned long)event + 8);
+}
+
+static struct ring_buffer_event *
+rb_reserve_next(struct hyp_rb_per_cpu *cpu_buffer, unsigned long length)
+{
+ unsigned long ts_ext_size = 0, event_size = rb_event_size(length);
+ struct hyp_buffer_page *tail_page = cpu_buffer->tail_page;
+ struct ring_buffer_event *event;
+ unsigned long write, prev_write;
+ u64 ts, time_delta;
+
+ ts = trace_clock();
+
+ time_delta = ts - cpu_buffer->write_stamp;
+
+ if (test_time_stamp(time_delta))
+ ts_ext_size = 8;
+
+ prev_write = tail_page->write;
+ write = prev_write + event_size + ts_ext_size;
+
+ if (unlikely(write > BUF_PAGE_SIZE))
+ tail_page = rb_move_tail(cpu_buffer);
+
+ if (!tail_page->entries) {
+ tail_page->page->time_stamp = ts;
+ time_delta = 0;
+ ts_ext_size = 0;
+ write = event_size;
+ prev_write = 0;
+ }
+
+ tail_page->write = write;
+ tail_page->entries++;
+
+ cpu_buffer->write_stamp = ts;
+
+ event = (struct ring_buffer_event *)(tail_page->page->data +
+ prev_write);
+ if (ts_ext_size) {
+ event = rb_add_ts_extend(event, time_delta);
+ time_delta = 0;
+ }
+
+ event->type_len = 0;
+ event->time_delta = time_delta;
+ event->array[0] = event_size - RB_EVNT_HDR_SIZE;
+
+ return event;
+}
+
+void *tracing_reserve_entry(unsigned long length)
+{
+ struct hyp_rb_per_cpu *cpu_buffer = this_cpu_ptr(&trace_rb);
+ struct ring_buffer_event *rb_event;
+
+ if (atomic_cmpxchg(&cpu_buffer->status, HYP_RB_READY, HYP_RB_WRITING)
+ == HYP_RB_UNAVAILABLE)
+ return NULL;
+
+ rb_event = rb_reserve_next(cpu_buffer, length);
+
+ return &rb_event->array[1];
+}
+
+void tracing_commit_entry(void)
+{
+ struct hyp_rb_per_cpu *cpu_buffer = this_cpu_ptr(&trace_rb);
+
+ local_set(&cpu_buffer->tail_page->page->commit,
+ cpu_buffer->tail_page->write);
+ WRITE_ONCE(cpu_buffer->meta->entries,
+ cpu_buffer->meta->entries + 1);
+
+ /* Paired with rb_cpu_disable_writing() */
+ atomic_set_release(&cpu_buffer->status, HYP_RB_READY);
+}
+
+static int rb_page_init(struct hyp_buffer_page *bpage, unsigned long hva)
+{
+ void *hyp_va = (void *)kern_hyp_va(hva);
+ int ret;
+
+ ret = hyp_pin_shared_mem(hyp_va, hyp_va + PAGE_SIZE);
+ if (ret)
+ return ret;
+
+ INIT_LIST_HEAD(&bpage->list);
+ bpage->page = (struct buffer_data_page *)hyp_va;
+
+ local_set(&bpage->page->commit, 0);
+
+ return 0;
+}
+
+static bool rb_cpu_loaded(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ return !!cpu_buffer->bpages;
+}
+
+static void rb_cpu_disable_writing(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ int prev_status;
+
+ /* Wait for the buffer to be released */
+ do {
+ prev_status = atomic_cmpxchg_acquire(&cpu_buffer->status,
+ HYP_RB_READY,
+ HYP_RB_UNAVAILABLE);
+ } while (prev_status == HYP_RB_WRITING);
+}
+
+static int rb_cpu_enable_writing(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ if (!rb_cpu_loaded(cpu_buffer))
+ return -ENODEV;
+
+ atomic_cmpxchg(&cpu_buffer->status, HYP_RB_UNAVAILABLE, HYP_RB_READY);
+
+ return 0;
+}
+
+static void rb_cpu_teardown(struct hyp_rb_per_cpu *cpu_buffer)
+{
+ int i;
+
+ if (!rb_cpu_loaded(cpu_buffer))
+ return;
+
+ rb_cpu_disable_writing(cpu_buffer);
+
+ hyp_unpin_shared_mem((void *)cpu_buffer->meta,
+ (void *)(cpu_buffer->meta) + PAGE_SIZE);
+
+ for (i = 0; i < cpu_buffer->nr_pages; i++) {
+ struct hyp_buffer_page *bpage = &cpu_buffer->bpages[i];
+
+ if (!bpage->page)
+ continue;
+
+ hyp_unpin_shared_mem((void *)bpage->page,
+ (void *)bpage->page + PAGE_SIZE);
+ }
+
+ cpu_buffer->bpages = 0;
+}
+
+static bool rb_cpu_fits_backing(unsigned long nr_pages,
+ struct hyp_buffer_page *start)
+{
+ unsigned long max = hyp_buffer_pages_backing.start +
+ hyp_buffer_pages_backing.size;
+ struct hyp_buffer_page *end = start + nr_pages;
+
+ return (unsigned long)end <= max;
+}
+
+static bool rb_cpu_fits_desc(struct rb_page_desc *pdesc,
+ unsigned long desc_end)
+{
+ unsigned long *end;
+
+ /* Check we can at least read nr_pages */
+ if ((unsigned long)&pdesc->nr_page_va >= desc_end)
+ return false;
+
+ end = &pdesc->page_va[pdesc->nr_page_va];
+
+ return (unsigned long)end <= desc_end;
+}
+
+static int rb_cpu_init(struct rb_page_desc *pdesc, struct hyp_buffer_page *start,
+ struct hyp_rb_per_cpu *cpu_buffer)
+{
+ struct hyp_buffer_page *bpage = start;
+ int i, ret;
+
+ /* At least 1 reader page and one head */
+ if (pdesc->nr_page_va < 2)
+ return -EINVAL;
+
+ if (!rb_cpu_fits_backing(pdesc->nr_page_va, start))
+ return -EINVAL;
+
+ if (rb_cpu_loaded(cpu_buffer))
+ return -EBUSY;
+
+ cpu_buffer->bpages = start;
+
+ cpu_buffer->meta = (struct trace_buffer_meta *)kern_hyp_va(pdesc->meta_va);
+ ret = hyp_pin_shared_mem((void *)cpu_buffer->meta,
+ ((void *)cpu_buffer->meta) + PAGE_SIZE);
+ if (ret)
+ return ret;
+
+ memset(cpu_buffer->meta, 0, sizeof(*cpu_buffer->meta));
+ cpu_buffer->meta->meta_page_size = PAGE_SIZE;
+ cpu_buffer->meta->nr_subbufs = cpu_buffer->nr_pages;
+
+ /* The reader page is not part of the ring initially */
+ ret = rb_page_init(bpage, pdesc->page_va[0]);
+ if (ret)
+ goto err;
+
+ cpu_buffer->nr_pages = 1;
+
+ cpu_buffer->reader_page = bpage;
+ cpu_buffer->tail_page = bpage + 1;
+ cpu_buffer->head_page = bpage + 1;
+
+ for (i = 1; i < pdesc->nr_page_va; i++) {
+ ret = rb_page_init(++bpage, pdesc->page_va[i]);
+ if (ret)
+ goto err;
+
+ bpage->list.next = &(bpage + 1)->list;
+ bpage->list.prev = &(bpage - 1)->list;
+ bpage->id = i;
+
+ cpu_buffer->nr_pages = i + 1;
+ }
+
+ /* Close the ring */
+ bpage->list.next = &cpu_buffer->tail_page->list;
+ cpu_buffer->tail_page->list.prev = &bpage->list;
+
+ /* The last init'ed page points to the head page */
+ rb_set_flag(bpage, HYP_RB_PAGE_HEAD);
+
+ cpu_buffer->last_overrun = 0;
+
+ return 0;
+
+err:
+ rb_cpu_teardown(cpu_buffer);
+
+ return ret;
+}
+
+static int rb_setup_bpage_backing(struct hyp_trace_desc *desc)
+{
+ unsigned long start = kern_hyp_va(desc->backing.start);
+ size_t size = desc->backing.size;
+ int ret;
+
+ if (hyp_buffer_pages_backing.size)
+ return -EBUSY;
+
+ if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(size))
+ return -EINVAL;
+
+ ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn((void *)start), size >> PAGE_SHIFT);
+ if (ret)
+ return ret;
+
+ memset((void *)start, 0, size);
+
+ hyp_buffer_pages_backing.start = start;
+ hyp_buffer_pages_backing.size = size;
+
+ return 0;
+}
+
+static void rb_teardown_bpage_backing(void)
+{
+ unsigned long start = hyp_buffer_pages_backing.start;
+ size_t size = hyp_buffer_pages_backing.size;
+
+ if (!size)
+ return;
+
+ memset((void *)start, 0, size);
+
+ WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(start), size >> PAGE_SHIFT));
+
+ hyp_buffer_pages_backing.start = 0;
+ hyp_buffer_pages_backing.size = 0;
+}
+
+int __pkvm_swap_reader_tracing(int cpu)
+{
+ struct hyp_rb_per_cpu *cpu_buffer = per_cpu_ptr(&trace_rb, cpu);
+ int ret = 0;
+
+ hyp_spin_lock(&trace_rb_lock);
+
+ if (cpu >= hyp_nr_cpus) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ cpu_buffer = per_cpu_ptr(&trace_rb, cpu);
+ if (!rb_cpu_loaded(cpu_buffer))
+ ret = -ENODEV;
+ else
+ ret = rb_swap_reader_page(cpu_buffer);
+
+err:
+ hyp_spin_unlock(&trace_rb_lock);
+
+ return ret;
+}
+
+static void __pkvm_teardown_tracing_locked(void)
+{
+ int cpu;
+
+ hyp_assert_lock_held(&trace_rb_lock);
+
+ for (cpu = 0; cpu < hyp_nr_cpus; cpu++) {
+ struct hyp_rb_per_cpu *cpu_buffer = per_cpu_ptr(&trace_rb, cpu);
+
+ rb_cpu_teardown(cpu_buffer);
+ }
+
+ rb_teardown_bpage_backing();
+}
+
+void __pkvm_teardown_tracing(void)
+{
+ hyp_spin_lock(&trace_rb_lock);
+ __pkvm_teardown_tracing_locked();
+ hyp_spin_unlock(&trace_rb_lock);
+}
+
+int __pkvm_load_tracing(unsigned long desc_hva, size_t desc_size)
+{
+ struct hyp_trace_desc *desc = (struct hyp_trace_desc *)kern_hyp_va(desc_hva);
+ struct trace_page_desc *trace_pdesc = &desc->page_desc;
+ struct hyp_buffer_page *bpage_backing_start;
+ struct rb_page_desc *pdesc;
+ int ret, cpu;
+
+ if (!desc_size || !PAGE_ALIGNED(desc_hva) || !PAGE_ALIGNED(desc_size))
+ return -EINVAL;
+
+ ret = __pkvm_host_donate_hyp(hyp_virt_to_pfn((void *)desc),
+ desc_size >> PAGE_SHIFT);
+ if (ret)
+ return ret;
+
+ hyp_spin_lock(&trace_rb_lock);
+
+ ret = rb_setup_bpage_backing(desc);
+ if (ret)
+ goto err;
+
+ trace_clock_update(&desc->clock_data);
+
+ bpage_backing_start = (struct hyp_buffer_page *)hyp_buffer_pages_backing.start;
+
+ for_each_rb_page_desc(pdesc, cpu, trace_pdesc) {
+ struct hyp_rb_per_cpu *cpu_buffer;
+ int cpu;
+
+ ret = -EINVAL;
+ if (!rb_cpu_fits_desc(pdesc, desc_hva + desc_size))
+ break;
+
+ cpu = pdesc->cpu;
+ if (cpu >= hyp_nr_cpus)
+ break;
+
+ cpu_buffer = per_cpu_ptr(&trace_rb, cpu);
+
+ ret = rb_cpu_init(pdesc, bpage_backing_start, cpu_buffer);
+ if (ret)
+ break;
+
+ bpage_backing_start += pdesc->nr_page_va;
+ }
+
+err:
+ if (ret)
+ __pkvm_teardown_tracing_locked();
+
+ hyp_spin_unlock(&trace_rb_lock);
+
+ WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn((void *)desc),
+ desc_size >> PAGE_SHIFT));
+ return ret;
+}
+
+int __pkvm_enable_tracing(bool enable)
+{
+ int cpu, ret = enable ? -EINVAL : 0;
+
+ hyp_spin_lock(&trace_rb_lock);
+ for (cpu = 0; cpu < hyp_nr_cpus; cpu++) {
+ struct hyp_rb_per_cpu *cpu_buffer = per_cpu_ptr(&trace_rb, cpu);
+
+ if (enable) {
+ if (!rb_cpu_enable_writing(cpu_buffer))
+ ret = 0;
+ } else {
+ rb_cpu_disable_writing(cpu_buffer);
+ }
+
+ }
+ hyp_spin_unlock(&trace_rb_lock);
+
+ return ret;
+}
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 08/11] KVM: arm64: Add hyp tracing to tracefs
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (6 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 07/11] KVM: arm64: Add tracing support for the pKVM hyp Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 09/11] KVM: arm64: Add raw interface for hyp tracefs Vincent Donnefort
` (3 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
When running with KVM protected mode, the hypervisor is able to generate
events into tracefs compatible ring-buffers. Plug those ring-buffers to
tracefs. The interface is found in hyp/ and contains the same hierarchy
as any host instances easing the support by existing user-space tools.
This currently doesn't provide any event support which will come later.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index a6497228c5a8..414a298a61f7 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -25,6 +25,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o
+kvm-$(CONFIG_TRACING) += hyp_trace.o
+
always-y := hyp_constants.h hyp-constants.s
define rule_gen_hyp_constants
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a7ca776b51ec..2242862e3223 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -25,6 +25,7 @@
#define CREATE_TRACE_POINTS
#include "trace_arm.h"
+#include "hyp_trace.h"
#include <linux/uaccess.h>
#include <asm/ptrace.h>
@@ -2335,6 +2336,9 @@ static int __init init_subsystems(void)
kvm_register_perf_callbacks(NULL);
+ err = hyp_trace_init_tracefs();
+ if (err)
+ kvm_err("Failed to initialize Hyp tracing\n");
out:
if (err)
hyp_cpu_pm_exit();
diff --git a/arch/arm64/kvm/hyp/hyp-constants.c b/arch/arm64/kvm/hyp/hyp-constants.c
index b257a3b4bfc5..5c4a797a701f 100644
--- a/arch/arm64/kvm/hyp/hyp-constants.c
+++ b/arch/arm64/kvm/hyp/hyp-constants.c
@@ -3,11 +3,15 @@
#include <linux/kbuild.h>
#include <nvhe/memory.h>
#include <nvhe/pkvm.h>
+#include <nvhe/trace.h>
int main(void)
{
DEFINE(STRUCT_HYP_PAGE_SIZE, sizeof(struct hyp_page));
DEFINE(PKVM_HYP_VM_SIZE, sizeof(struct pkvm_hyp_vm));
DEFINE(PKVM_HYP_VCPU_SIZE, sizeof(struct pkvm_hyp_vcpu));
+#ifdef CONFIG_TRACING
+ DEFINE(STRUCT_HYP_BUFFER_PAGE_SIZE, sizeof(struct hyp_buffer_page));
+#endif
return 0;
}
diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c
new file mode 100644
index 000000000000..b4de7650cd8f
--- /dev/null
+++ b/arch/arm64/kvm/hyp_trace.c
@@ -0,0 +1,729 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ * Author: Vincent Donnefort <vdonnefort@google.com>
+ */
+
+#include <linux/arm-smccc.h>
+#include <linux/list.h>
+#include <linux/percpu-defs.h>
+#include <linux/tracefs.h>
+
+#include <asm/kvm_host.h>
+#include <asm/kvm_hyptrace.h>
+
+#include "hyp_constants.h"
+#include "hyp_trace.h"
+
+#define RB_POLL_MS 100
+
+#define TRACEFS_DIR "hyp"
+#define TRACEFS_MODE_WRITE 0640
+#define TRACEFS_MODE_READ 0440
+
+static struct hyp_trace_buffer {
+ struct hyp_trace_desc *desc;
+ struct ring_buffer_writer writer;
+ struct trace_buffer *trace_buffer;
+ size_t desc_size;
+ bool tracing_on;
+ int nr_readers;
+ struct mutex lock;
+} hyp_trace_buffer = {
+ .lock = __MUTEX_INITIALIZER(hyp_trace_buffer.lock),
+};
+
+static size_t hyp_trace_buffer_size = 7 << 10;
+
+/* Number of pages the ring-buffer requires to accommodate for size */
+#define NR_PAGES(size) \
+ ((PAGE_ALIGN(size) >> PAGE_SHIFT) + 1)
+
+static inline bool hyp_trace_buffer_loaded(struct hyp_trace_buffer *hyp_buffer)
+{
+ return !!hyp_buffer->trace_buffer;
+}
+
+static int
+bpage_backing_alloc(struct hyp_buffer_pages_backing *bpage_backing, size_t size)
+{
+ size_t backing_size;
+ void *start;
+
+ backing_size = PAGE_ALIGN(STRUCT_HYP_BUFFER_PAGE_SIZE * NR_PAGES(size));
+
+ start = alloc_pages_exact(backing_size, GFP_KERNEL_ACCOUNT);
+ if (!start)
+ return -ENOMEM;
+
+ bpage_backing->start = (unsigned long)start;
+ bpage_backing->size = backing_size;
+
+ return 0;
+}
+
+static void
+bpage_backing_free(struct hyp_buffer_pages_backing *bpage_backing)
+{
+ free_pages_exact((void *)bpage_backing->start, bpage_backing->size);
+}
+
+/*
+ * Configure the hyp tracing clock. So far, only one is supported: "boot". This
+ * clock doesn't stop during suspend making it a good candidate. The downside is
+ * if this clock is corrected by NTP while tracing, the hyp clock will slightly
+ * drift compared to the host version.
+ */
+static void hyp_clock_setup(struct hyp_trace_desc *desc)
+{
+ struct kvm_nvhe_clock_data *clock_data = &desc->clock_data;
+ struct system_time_snapshot snap;
+
+ ktime_get_snapshot(&snap);
+
+ clock_data->epoch_cyc = snap.cycles;
+ clock_data->epoch_ns = snap.boot;
+ clock_data->mult = snap.mono_mult;
+ clock_data->shift = snap.mono_shift;
+}
+
+static int __get_reader_page(int cpu)
+{
+ return kvm_call_hyp_nvhe(__pkvm_swap_reader_tracing, cpu);
+}
+
+static void hyp_trace_free_pages(struct hyp_trace_desc *desc)
+{
+ struct rb_page_desc *rb_desc;
+ int cpu, id;
+
+ for_each_rb_page_desc(rb_desc, cpu, &desc->page_desc) {
+ free_page(rb_desc->meta_va);
+ for (id = 0; id < rb_desc->nr_page_va; id++)
+ free_page(rb_desc->page_va[id]);
+ }
+}
+
+static int hyp_trace_alloc_pages(struct hyp_trace_desc *desc, size_t size)
+{
+ int err = 0, cpu, id, nr_pages = NR_PAGES(size);
+ struct trace_page_desc *trace_desc;
+ struct rb_page_desc *rb_desc;
+
+ trace_desc = &desc->page_desc;
+ trace_desc->nr_cpus = 0;
+
+ rb_desc = (struct rb_page_desc *)&trace_desc->__data[0];
+
+ for_each_possible_cpu(cpu) {
+ rb_desc->cpu = cpu;
+ rb_desc->nr_page_va = 0;
+ rb_desc->meta_va = (unsigned long)page_to_virt(alloc_page(GFP_KERNEL));
+ if (!rb_desc->meta_va) {
+ err = -ENOMEM;
+ break;
+ }
+ for (id = 0; id < nr_pages; id++) {
+ rb_desc->page_va[id] = (unsigned long)page_to_virt(alloc_page(GFP_KERNEL));
+ if (!rb_desc->page_va[id]) {
+ err = -ENOMEM;
+ break;
+ }
+ rb_desc->nr_page_va++;
+ }
+ trace_desc->nr_cpus++;
+ rb_desc = __next_rb_page_desc(rb_desc);
+ }
+
+ if (err) {
+ hyp_trace_free_pages(desc);
+ return err;
+ }
+
+ return 0;
+}
+
+static int __load_page(unsigned long va)
+{
+ return kvm_call_hyp_nvhe(__pkvm_host_share_hyp, virt_to_pfn((void *)va), 1);
+}
+
+static void __teardown_page(unsigned long va)
+{
+ WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp, virt_to_pfn((void *)va), 1));
+}
+
+static void hyp_trace_teardown_pages(struct hyp_trace_desc *desc,
+ int last_cpu)
+{
+ struct rb_page_desc *rb_desc;
+ int cpu, id;
+
+ for_each_rb_page_desc(rb_desc, cpu, &desc->page_desc) {
+ if (cpu > last_cpu)
+ break;
+ __teardown_page(rb_desc->meta_va);
+ for (id = 0; id < rb_desc->nr_page_va; id++)
+ __teardown_page(rb_desc->page_va[id]);
+ }
+}
+
+static int hyp_trace_load_pages(struct hyp_trace_desc *desc)
+{
+ int last_loaded_cpu = 0, cpu, id, err = -EINVAL;
+ struct rb_page_desc *rb_desc;
+
+ for_each_rb_page_desc(rb_desc, cpu, &desc->page_desc) {
+ err = __load_page(rb_desc->meta_va);
+ if (err)
+ break;
+
+ for (id = 0; id < rb_desc->nr_page_va; id++) {
+ err = __load_page(rb_desc->page_va[id]);
+ if (err)
+ break;
+ }
+
+ if (!err)
+ continue;
+
+ for (id--; id >= 0; id--)
+ __teardown_page(rb_desc->page_va[id]);
+
+ last_loaded_cpu = cpu - 1;
+
+ break;
+ }
+
+ if (!err)
+ return 0;
+
+ hyp_trace_teardown_pages(desc, last_loaded_cpu);
+
+ return err;
+}
+
+static int hyp_trace_buffer_load(struct hyp_trace_buffer *hyp_buffer, size_t size)
+{
+ int ret, nr_pages = NR_PAGES(size);
+ struct rb_page_desc *rbdesc;
+ struct hyp_trace_desc *desc;
+ size_t desc_size;
+
+ if (hyp_trace_buffer_loaded(hyp_buffer))
+ return 0;
+
+ desc_size = size_add(offsetof(struct hyp_trace_desc, page_desc),
+ offsetof(struct trace_page_desc, __data));
+ desc_size = size_add(desc_size,
+ size_mul(num_possible_cpus(),
+ struct_size(rbdesc, page_va, nr_pages)));
+ if (desc_size == SIZE_MAX)
+ return -E2BIG;
+
+ /*
+ * The hypervisor will unmap the descriptor from the host to protect the
+ * reading. Page granularity for the allocation ensures no other
+ * useful data will be unmapped.
+ */
+ desc_size = PAGE_ALIGN(desc_size);
+
+ desc = (struct hyp_trace_desc *)alloc_pages_exact(desc_size, GFP_KERNEL);
+ if (!desc)
+ return -ENOMEM;
+
+ ret = hyp_trace_alloc_pages(desc, size);
+ if (ret)
+ goto err_free_desc;
+
+ ret = bpage_backing_alloc(&desc->backing, size);
+ if (ret)
+ goto err_free_pages;
+
+ ret = hyp_trace_load_pages(desc);
+ if (ret)
+ goto err_free_backing;
+
+ hyp_clock_setup(desc);
+
+ ret = kvm_call_hyp_nvhe(__pkvm_load_tracing, (unsigned long)desc, desc_size);
+ if (ret)
+ goto err_teardown_pages;
+
+ hyp_buffer->writer.pdesc = &desc->page_desc;
+ hyp_buffer->writer.get_reader_page = __get_reader_page;
+ hyp_buffer->trace_buffer = ring_buffer_reader(&hyp_buffer->writer);
+ if (!hyp_buffer->trace_buffer) {
+ ret = -ENOMEM;
+ goto err_teardown_tracing;
+ }
+
+ hyp_buffer->desc = desc;
+ hyp_buffer->desc_size = desc_size;
+
+ return 0;
+
+err_teardown_tracing:
+ kvm_call_hyp_nvhe(__pkvm_teardown_tracing);
+err_teardown_pages:
+ hyp_trace_teardown_pages(desc, INT_MAX);
+err_free_backing:
+ bpage_backing_free(&desc->backing);
+err_free_pages:
+ hyp_trace_free_pages(desc);
+err_free_desc:
+ free_pages_exact(desc, desc_size);
+
+ return ret;
+}
+
+static void hyp_trace_buffer_teardown(struct hyp_trace_buffer *hyp_buffer)
+{
+ struct hyp_trace_desc *desc = hyp_buffer->desc;
+ size_t desc_size = hyp_buffer->desc_size;
+
+ if (kvm_call_hyp_nvhe(__pkvm_teardown_tracing))
+ return;
+
+ ring_buffer_free(hyp_buffer->trace_buffer);
+ hyp_trace_teardown_pages(desc, INT_MAX);
+ bpage_backing_free(&desc->backing);
+ hyp_trace_free_pages(desc);
+ free_pages_exact(desc, desc_size);
+ hyp_buffer->trace_buffer = NULL;
+}
+
+static int hyp_tracing_teardown(void)
+{
+ struct hyp_trace_buffer *hyp_buffer = &hyp_trace_buffer;
+ int ret = 0;
+
+ mutex_lock(&hyp_buffer->lock);
+ if (!hyp_trace_buffer_loaded(hyp_buffer))
+ goto out;
+
+ if (hyp_buffer->tracing_on || hyp_buffer->nr_readers) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ hyp_trace_buffer_teardown(hyp_buffer);
+out:
+ mutex_unlock(&hyp_buffer->lock);
+
+ return ret;
+}
+
+static int hyp_tracing_start(void)
+{
+ struct hyp_trace_buffer *hyp_buffer = &hyp_trace_buffer;
+ int ret;
+
+ mutex_lock(&hyp_buffer->lock);
+
+ ret = hyp_trace_buffer_load(hyp_buffer, hyp_trace_buffer_size);
+ if (ret)
+ goto out;
+
+ ret = kvm_call_hyp_nvhe(__pkvm_enable_tracing, true);
+ if (!ret)
+ hyp_buffer->tracing_on = true;
+
+out:
+ mutex_unlock(&hyp_buffer->lock);
+
+ return ret;
+}
+
+static void hyp_tracing_stop(void)
+{
+ struct hyp_trace_buffer *hyp_buffer = &hyp_trace_buffer;
+ int ret;
+
+ mutex_lock(&hyp_buffer->lock);
+ if (!hyp_trace_buffer_loaded(hyp_buffer))
+ goto end;
+
+ ret = kvm_call_hyp_nvhe(__pkvm_enable_tracing, false);
+ if (!ret) {
+ /*
+ * There are no way to flush the remaining data on reader
+ * release. So instead, do it when tracing stops.
+ */
+ ring_buffer_poll_writer(hyp_buffer->trace_buffer,
+ RING_BUFFER_ALL_CPUS);
+ hyp_buffer->tracing_on = false;
+ }
+
+end:
+ mutex_unlock(&hyp_buffer->lock);
+}
+
+static ssize_t
+hyp_tracing_on(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos)
+{
+ int err = 0;
+ char c;
+
+ if (!cnt || cnt > 2)
+ return -EINVAL;
+
+ if (get_user(c, ubuf))
+ return -EFAULT;
+
+ switch (c) {
+ case '1':
+ err = hyp_tracing_start();
+ break;
+ case '0':
+ hyp_tracing_stop();
+ break;
+ default:
+ err = -EINVAL;
+ }
+
+ return err ? err : cnt;
+}
+
+static ssize_t hyp_tracing_on_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[3];
+ int r;
+
+ mutex_lock(&hyp_trace_buffer.lock);
+ r = sprintf(buf, "%d\n", hyp_trace_buffer.tracing_on);
+ mutex_unlock(&hyp_trace_buffer.lock);
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static const struct file_operations hyp_tracing_on_fops = {
+ .write = hyp_tracing_on,
+ .read = hyp_tracing_on_read,
+};
+
+static ssize_t hyp_buffer_size(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ unsigned long val;
+ int ret;
+
+ ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+ if (ret)
+ return ret;
+
+ if (!val)
+ return -EINVAL;
+
+ mutex_lock(&hyp_trace_buffer.lock);
+ hyp_trace_buffer_size = val << 10; /* KB to B */
+ mutex_unlock(&hyp_trace_buffer.lock);
+
+ return cnt;
+}
+
+static ssize_t hyp_buffer_size_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[64];
+ int r;
+
+ mutex_lock(&hyp_trace_buffer.lock);
+ r = sprintf(buf, "%lu\n", hyp_trace_buffer_size >> 10);
+ mutex_unlock(&hyp_trace_buffer.lock);
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static const struct file_operations hyp_buffer_size_fops = {
+ .write = hyp_buffer_size,
+ .read = hyp_buffer_size_read,
+};
+
+static void ht_print_trace_time(struct ht_iterator *iter)
+{
+ unsigned long usecs_rem;
+ u64 ts_ns = iter->ts;
+
+ do_div(ts_ns, 1000);
+ usecs_rem = do_div(ts_ns, USEC_PER_SEC);
+
+ trace_seq_printf(&iter->seq, "%5lu.%06lu: ",
+ (unsigned long)ts_ns, usecs_rem);
+}
+
+static void ht_print_trace_cpu(struct ht_iterator *iter)
+{
+ trace_seq_printf(&iter->seq, "[%03d]\t", iter->ent_cpu);
+}
+
+static int ht_print_trace_fmt(struct ht_iterator *iter)
+{
+ if (iter->lost_events)
+ trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n",
+ iter->ent_cpu, iter->lost_events);
+
+ ht_print_trace_cpu(iter);
+ ht_print_trace_time(iter);
+
+ return trace_seq_has_overflowed(&iter->seq) ? -EOVERFLOW : 0;
+};
+
+static struct ring_buffer_event *__ht_next_pipe_event(struct ht_iterator *iter)
+{
+ struct ring_buffer_event *evt = NULL;
+ int cpu = iter->cpu;
+
+ if (cpu != RING_BUFFER_ALL_CPUS) {
+ if (ring_buffer_empty_cpu(iter->trace_buffer, cpu))
+ return NULL;
+
+ iter->ent_cpu = cpu;
+
+ return ring_buffer_peek(iter->trace_buffer, cpu, &iter->ts,
+ &iter->lost_events);
+ }
+
+ iter->ts = LLONG_MAX;
+ for_each_possible_cpu(cpu) {
+ struct ring_buffer_event *_evt;
+ unsigned long lost_events;
+ u64 ts;
+
+ if (ring_buffer_empty_cpu(iter->trace_buffer, cpu))
+ continue;
+
+ _evt = ring_buffer_peek(iter->trace_buffer, cpu, &ts,
+ &lost_events);
+ if (!_evt)
+ continue;
+
+ if (ts >= iter->ts)
+ continue;
+
+ iter->ts = ts;
+ iter->ent_cpu = cpu;
+ iter->lost_events = lost_events;
+ evt = _evt;
+ }
+
+ return evt;
+}
+
+static void *ht_next_pipe_event(struct ht_iterator *iter)
+{
+ struct ring_buffer_event *event;
+
+ event = __ht_next_pipe_event(iter);
+ if (!event)
+ return NULL;
+
+ iter->ent = (struct hyp_entry_hdr *)&event->array[1];
+ iter->ent_size = event->array[0];
+
+ return iter;
+}
+
+static ssize_t
+hyp_trace_pipe_read(struct file *file, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct ht_iterator *iter = (struct ht_iterator *)file->private_data;
+ int ret;
+
+copy_to_user:
+ ret = trace_seq_to_user(&iter->seq, ubuf, cnt);
+ if (ret != -EBUSY)
+ return ret;
+
+ trace_seq_init(&iter->seq);
+
+ ret = ring_buffer_wait(iter->trace_buffer, iter->cpu, 0, NULL, NULL);
+ if (ret < 0)
+ return ret;
+
+ while (ht_next_pipe_event(iter)) {
+ int prev_len = iter->seq.seq.len;
+
+ if (ht_print_trace_fmt(iter)) {
+ iter->seq.seq.len = prev_len;
+ break;
+ }
+
+ ring_buffer_consume(iter->trace_buffer, iter->ent_cpu, NULL,
+ NULL);
+ }
+
+ goto copy_to_user;
+}
+
+static void __poll_writer(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct ht_iterator *iter;
+
+ iter = container_of(dwork, struct ht_iterator, poll_work);
+
+ ring_buffer_poll_writer(iter->trace_buffer, iter->cpu);
+
+ schedule_delayed_work((struct delayed_work *)work,
+ msecs_to_jiffies(RB_POLL_MS));
+}
+
+static int hyp_trace_pipe_open(struct inode *inode, struct file *file)
+{
+ struct hyp_trace_buffer *hyp_buffer = &hyp_trace_buffer;
+ int cpu = (s64)inode->i_private;
+ struct ht_iterator *iter = NULL;
+ bool need_loading = false;
+ int ret;
+
+ mutex_lock(&hyp_buffer->lock);
+
+ if (hyp_buffer->nr_readers == INT_MAX) {
+ ret = -EBUSY;
+ goto unlock;
+ }
+
+ need_loading = !hyp_trace_buffer_loaded(hyp_buffer);
+ if (need_loading) {
+ ret = hyp_trace_buffer_load(hyp_buffer, hyp_trace_buffer_size);
+ if (ret) {
+ need_loading = false;
+ goto unlock;
+ }
+ }
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+ iter->trace_buffer = hyp_buffer->trace_buffer;
+ iter->cpu = cpu;
+ trace_seq_init(&iter->seq);
+ file->private_data = iter;
+
+ ret = ring_buffer_poll_writer(hyp_buffer->trace_buffer, cpu);
+ if (ret)
+ goto unlock;
+
+ INIT_DELAYED_WORK(&iter->poll_work, __poll_writer);
+ schedule_delayed_work(&iter->poll_work, msecs_to_jiffies(RB_POLL_MS));
+
+ hyp_buffer->nr_readers++;
+
+unlock:
+ if (ret) {
+ if (need_loading)
+ hyp_trace_buffer_teardown(hyp_buffer);
+ kfree(iter);
+ }
+
+ mutex_unlock(&hyp_buffer->lock);
+
+ return ret;
+}
+
+static int hyp_trace_pipe_release(struct inode *inode, struct file *file)
+{
+ struct hyp_trace_buffer *hyp_buffer = &hyp_trace_buffer;
+ struct ht_iterator *iter = file->private_data;
+
+ cancel_delayed_work_sync(&iter->poll_work);
+
+ mutex_lock(&hyp_buffer->lock);
+ WARN_ON(--hyp_buffer->nr_readers < 0);
+ mutex_unlock(&hyp_buffer->lock);
+
+ kfree(iter);
+
+ return 0;
+}
+
+static const struct file_operations hyp_trace_pipe_fops = {
+ .open = hyp_trace_pipe_open,
+ .read = hyp_trace_pipe_read,
+ .release = hyp_trace_pipe_release,
+ .llseek = no_llseek,
+};
+
+static int hyp_trace_open(struct inode *inode, struct file *file)
+{
+ return file->f_mode & FMODE_WRITE ? hyp_tracing_teardown() : 0;
+}
+
+static ssize_t hyp_trace_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ char buf[] = "** Reading trace not yet supported **\n";
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, strlen(buf));
+}
+
+static ssize_t hyp_trace_write(struct file *filp, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ return count;
+}
+
+static const struct file_operations hyp_trace_fops = {
+ .open = hyp_trace_open,
+ .read = hyp_trace_read,
+ .write = hyp_trace_write,
+ .release = NULL,
+};
+
+int hyp_trace_init_tracefs(void)
+{
+ struct dentry *root, *per_cpu_root;
+ char per_cpu_name[16];
+ long cpu;
+
+ if (!is_protected_kvm_enabled())
+ return 0;
+
+ root = tracefs_create_dir(TRACEFS_DIR, NULL);
+ if (!root) {
+ pr_err("Failed to create tracefs "TRACEFS_DIR"/\n");
+ return -ENODEV;
+ }
+
+ tracefs_create_file("tracing_on", TRACEFS_MODE_WRITE, root, NULL,
+ &hyp_tracing_on_fops);
+
+ tracefs_create_file("buffer_size_kb", TRACEFS_MODE_WRITE, root, NULL,
+ &hyp_buffer_size_fops);
+
+ tracefs_create_file("trace", TRACEFS_MODE_WRITE, root,
+ (void *)RING_BUFFER_ALL_CPUS, &hyp_trace_fops);
+
+ tracefs_create_file("trace_pipe", TRACEFS_MODE_WRITE, root,
+ (void *)RING_BUFFER_ALL_CPUS, &hyp_trace_pipe_fops);
+
+ tracefs_create_file("trace", TRACEFS_MODE_WRITE, root, NULL,
+ &hyp_trace_fops);
+
+ per_cpu_root = tracefs_create_dir("per_cpu", root);
+ if (!per_cpu_root) {
+ pr_err("Failed to create tracefs folder "TRACEFS_DIR"/per_cpu/\n");
+ return -ENODEV;
+ }
+
+ for_each_possible_cpu(cpu) {
+ struct dentry *per_cpu_dir;
+
+ snprintf(per_cpu_name, sizeof(per_cpu_name), "cpu%ld", cpu);
+ per_cpu_dir = tracefs_create_dir(per_cpu_name, per_cpu_root);
+ if (!per_cpu_dir) {
+ pr_warn("Failed to create tracefs "TRACEFS_DIR"/per_cpu/cpu%ld\n",
+ cpu);
+ continue;
+ }
+ tracefs_create_file("trace_pipe", TRACEFS_MODE_READ, per_cpu_dir,
+ (void *)cpu, &hyp_trace_pipe_fops);
+ }
+
+ return 0;
+}
diff --git a/arch/arm64/kvm/hyp_trace.h b/arch/arm64/kvm/hyp_trace.h
new file mode 100644
index 000000000000..14fc06c625a6
--- /dev/null
+++ b/arch/arm64/kvm/hyp_trace.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ARM64_KVM_HYP_TRACE_H__
+#define __ARM64_KVM_HYP_TRACE_H__
+
+#include <linux/trace_seq.h>
+#include <linux/workqueue.h>
+
+struct ht_iterator {
+ struct trace_buffer *trace_buffer;
+ int cpu;
+ struct hyp_entry_hdr *ent;
+ unsigned long lost_events;
+ int ent_cpu;
+ size_t ent_size;
+ u64 ts;
+ void *spare;
+ size_t copy_leftover;
+ struct trace_seq seq;
+ struct delayed_work poll_work;
+};
+
+#ifdef CONFIG_TRACING
+int hyp_trace_init_tracefs(void);
+#else
+static inline int hyp_trace_init_tracefs(void) { return 0; }
+#endif
+#endif
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 09/11] KVM: arm64: Add raw interface for hyp tracefs
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (7 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 08/11] KVM: arm64: Add hyp tracing to tracefs Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 10/11] KVM: arm64: Add support for hyp events Vincent Donnefort
` (2 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
The raw interface enables userspace tools such as trace-cmd to directly
read the ring-buffer without any decoding by the kernel.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c
index b4de7650cd8f..36c120aeabc0 100644
--- a/arch/arm64/kvm/hyp_trace.c
+++ b/arch/arm64/kvm/hyp_trace.c
@@ -649,6 +649,86 @@ static const struct file_operations hyp_trace_pipe_fops = {
.llseek = no_llseek,
};
+static ssize_t
+hyp_trace_raw_read(struct file *file, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct ht_iterator *iter = (struct ht_iterator *)file->private_data;
+ size_t size;
+ int ret;
+
+ if (iter->copy_leftover)
+ goto read;
+
+again:
+ ret = ring_buffer_read_page(iter->trace_buffer,
+ (struct buffer_data_read_page *)iter->spare,
+ cnt, iter->cpu, 0);
+ if (ret < 0) {
+ if (!ring_buffer_empty_cpu(iter->trace_buffer, iter->cpu))
+ return 0;
+
+ ret = ring_buffer_wait(iter->trace_buffer, iter->cpu, 0, NULL,
+ NULL);
+ if (ret < 0)
+ return ret;
+
+ goto again;
+ }
+
+ iter->copy_leftover = 0;
+
+read:
+ size = PAGE_SIZE - iter->copy_leftover;
+ if (size > cnt)
+ size = cnt;
+
+ ret = copy_to_user(ubuf, iter->spare + PAGE_SIZE - size, size);
+ if (ret == size)
+ return -EFAULT;
+
+ size -= ret;
+ *ppos += size;
+ iter->copy_leftover = ret;
+
+ return size;
+}
+
+static int hyp_trace_raw_open(struct inode *inode, struct file *file)
+{
+ int ret = hyp_trace_pipe_open(inode, file);
+ struct ht_iterator *iter;
+
+ if (ret)
+ return ret;
+
+ iter = file->private_data;
+ iter->spare = ring_buffer_alloc_read_page(iter->trace_buffer, iter->cpu);
+ if (IS_ERR(iter->spare)) {
+ ret = PTR_ERR(iter->spare);
+ iter->spare = NULL;
+ return ret;
+ }
+
+ return 0;
+}
+
+static int hyp_trace_raw_release(struct inode *inode, struct file *file)
+{
+ struct ht_iterator *iter = file->private_data;
+
+ ring_buffer_free_read_page(iter->trace_buffer, iter->cpu, iter->spare);
+
+ return hyp_trace_pipe_release(inode, file);
+}
+
+static const struct file_operations hyp_trace_raw_fops = {
+ .open = hyp_trace_raw_open,
+ .read = hyp_trace_raw_read,
+ .release = hyp_trace_raw_release,
+ .llseek = no_llseek,
+};
+
static int hyp_trace_open(struct inode *inode, struct file *file)
{
return file->f_mode & FMODE_WRITE ? hyp_tracing_teardown() : 0;
@@ -723,6 +803,8 @@ int hyp_trace_init_tracefs(void)
}
tracefs_create_file("trace_pipe", TRACEFS_MODE_READ, per_cpu_dir,
(void *)cpu, &hyp_trace_pipe_fops);
+ tracefs_create_file("trace_pipe_raw", TRACEFS_MODE_READ, per_cpu_dir,
+ (void *)cpu, &hyp_trace_pipe_fops);
}
return 0;
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 10/11] KVM: arm64: Add support for hyp events
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (8 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 09/11] KVM: arm64: Add raw interface for hyp tracefs Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 11/11] KVM: arm64: Add kselftest for tracefs hyp tracefs Vincent Donnefort
2024-08-06 20:11 ` [RFC PATCH 00/11] Tracefs support for pKVM Steven Rostedt
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
Following the introduction of hyp tracing for pKVM, add the ability to
describe and emit events into the hypervisor ring-buffers.
Hypervisor events are declared into kvm_hypevents.h and can be called
with trace_<event_name>() in a similar fashion to the kernel tracefs
events.
hyp_enter and hyp_exit events are provided as an example.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index d549d7d491c3..1a576ef28b7f 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -83,6 +83,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_tracing,
__KVM_HOST_SMCCC_FUNC___pkvm_enable_tracing,
__KVM_HOST_SMCCC_FUNC___pkvm_swap_reader_tracing,
+ __KVM_HOST_SMCCC_FUNC___pkvm_enable_event,
};
#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
diff --git a/arch/arm64/include/asm/kvm_define_hypevents.h b/arch/arm64/include/asm/kvm_define_hypevents.h
new file mode 100644
index 000000000000..efa2c2cb3ef2
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_define_hypevents.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/trace_events.h>
+
+#include <asm/kvm_hyptrace.h>
+#include <asm/kvm_hypevents_defs.h>
+
+#ifndef HYP_EVENT_FILE
+# undef __ARM64_KVM_HYPEVENTS_H_
+# define __HYP_EVENT_FILE <asm/kvm_hypevents.h>
+#else
+# define __HYP_EVENT_FILE __stringify(HYP_EVENT_FILE)
+#endif
+
+#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \
+ HYP_EVENT_FORMAT(__name, __struct); \
+ static void hyp_event_trace_##__name(struct ht_iterator *iter) \
+ { \
+ struct trace_hyp_format_##__name __maybe_unused *__entry = \
+ (struct trace_hyp_format_##__name *)iter->ent; \
+ trace_seq_puts(&iter->seq, #__name); \
+ trace_seq_putc(&iter->seq, ' '); \
+ trace_seq_printf(&iter->seq, __printk); \
+ trace_seq_putc(&iter->seq, '\n'); \
+ }
+#define HYP_EVENT_MULTI_READ
+#include __HYP_EVENT_FILE
+
+#undef he_field
+#define he_field(_type, _item) \
+ { \
+ .type = #_type, .name = #_item, \
+ .size = sizeof(_type), .align = __alignof__(_type), \
+ .is_signed = is_signed_type(_type), \
+ },
+#undef HYP_EVENT
+#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \
+ static struct trace_event_fields hyp_event_fields_##__name[] = { \
+ __struct \
+ {} \
+ };
+#include __HYP_EVENT_FILE
+
+#undef HYP_EVENT
+#undef HE_PRINTK
+#define __entry REC
+#define HE_PRINTK(fmt, args...) "\"" fmt "\", " __stringify(args)
+#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \
+ static char hyp_event_print_fmt_##__name[] = __printk; \
+ static bool hyp_event_enabled_##__name; \
+ struct hyp_event __section("_hyp_events") hyp_event_##__name = {\
+ .name = #__name, \
+ .enabled = &hyp_event_enabled_##__name, \
+ .fields = hyp_event_fields_##__name, \
+ .print_fmt = hyp_event_print_fmt_##__name, \
+ .trace_func = hyp_event_trace_##__name, \
+ }
+#include __HYP_EVENT_FILE
+
+#undef HYP_EVENT_MULTI_READ
diff --git a/arch/arm64/include/asm/kvm_hypevents.h b/arch/arm64/include/asm/kvm_hypevents.h
new file mode 100644
index 000000000000..0b98a87a1250
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_hypevents.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#if !defined(__ARM64_KVM_HYPEVENTS_H_) || defined(HYP_EVENT_MULTI_READ)
+#define __ARM64_KVM_HYPEVENTS_H_
+
+#ifdef __KVM_NVHE_HYPERVISOR__
+#include <nvhe/trace.h>
+#endif
+
+/*
+ * Hypervisor events definitions.
+ */
+
+HYP_EVENT(hyp_enter,
+ HE_PROTO(void),
+ HE_STRUCT(
+ ),
+ HE_ASSIGN(
+ ),
+ HE_PRINTK(" ")
+);
+
+HYP_EVENT(hyp_exit,
+ HE_PROTO(void),
+ HE_STRUCT(
+ ),
+ HE_ASSIGN(
+ ),
+ HE_PRINTK(" ")
+);
+#endif
diff --git a/arch/arm64/include/asm/kvm_hypevents_defs.h b/arch/arm64/include/asm/kvm_hypevents_defs.h
new file mode 100644
index 000000000000..473bf4363d82
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_hypevents_defs.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ARM64_KVM_HYPEVENTS_DEFS_H
+#define __ARM64_KVM_HYPEVENTS_DEFS_H
+
+struct hyp_event_id {
+ unsigned short id;
+ void *data;
+};
+
+#define HYP_EVENT_NAME_MAX 32
+
+struct hyp_event {
+ char name[HYP_EVENT_NAME_MAX];
+ bool *enabled;
+ char *print_fmt;
+ struct trace_event_fields *fields;
+ void (*trace_func)(struct ht_iterator *iter);
+ int id;
+};
+
+struct hyp_entry_hdr {
+ unsigned short id;
+};
+
+/*
+ * Hyp events definitions common to the hyp and the host
+ */
+#define HYP_EVENT_FORMAT(__name, __struct) \
+ struct __packed trace_hyp_format_##__name { \
+ struct hyp_entry_hdr hdr; \
+ __struct \
+ }
+
+#define HE_PROTO(args...) args
+#define HE_STRUCT(args...) args
+#define HE_ASSIGN(args...) args
+#define HE_PRINTK(args...) args
+
+#define he_field(type, item) type item;
+#endif
diff --git a/arch/arm64/include/asm/kvm_hyptrace.h b/arch/arm64/include/asm/kvm_hyptrace.h
index b371776da0fe..b4fca6986546 100644
--- a/arch/arm64/include/asm/kvm_hyptrace.h
+++ b/arch/arm64/include/asm/kvm_hyptrace.h
@@ -4,6 +4,22 @@
#include <asm/kvm_hyp.h>
#include <linux/ring_buffer.h>
+#include <linux/trace_seq.h>
+#include <linux/workqueue.h>
+
+struct ht_iterator {
+ struct trace_buffer *trace_buffer;
+ int cpu;
+ struct hyp_entry_hdr *ent;
+ unsigned long lost_events;
+ int ent_cpu;
+ size_t ent_size;
+ u64 ts;
+ void *spare;
+ size_t copy_leftover;
+ struct trace_seq seq;
+ struct delayed_work poll_work;
+};
/*
* Host donations to the hypervisor to store the struct hyp_buffer_page.
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 8f5422ed1b75..e60754cdbf33 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -134,6 +134,10 @@ KVM_NVHE_ALIAS(__hyp_bss_start);
KVM_NVHE_ALIAS(__hyp_bss_end);
KVM_NVHE_ALIAS(__hyp_rodata_start);
KVM_NVHE_ALIAS(__hyp_rodata_end);
+#ifdef CONFIG_TRACING
+KVM_NVHE_ALIAS(__hyp_event_ids_start);
+KVM_NVHE_ALIAS(__hyp_event_ids_end);
+#endif
/* pKVM static key */
KVM_NVHE_ALIAS(kvm_protected_mode_initialized);
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 55a8e310ea12..96986c1f852c 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -13,12 +13,23 @@
*(__kvm_ex_table) \
__stop___kvm_ex_table = .;
+#ifdef CONFIG_TRACING
+#define HYPERVISOR_EVENT_IDS \
+ . = ALIGN(PAGE_SIZE); \
+ __hyp_event_ids_start = .; \
+ *(HYP_SECTION_NAME(.event_ids)) \
+ __hyp_event_ids_end = .;
+#else
+#define HYPERVISOR_EVENT_IDS
+#endif
+
#define HYPERVISOR_DATA_SECTIONS \
HYP_SECTION_NAME(.rodata) : { \
. = ALIGN(PAGE_SIZE); \
__hyp_rodata_start = .; \
*(HYP_SECTION_NAME(.data..ro_after_init)) \
*(HYP_SECTION_NAME(.rodata)) \
+ HYPERVISOR_EVENT_IDS \
. = ALIGN(PAGE_SIZE); \
__hyp_rodata_end = .; \
}
@@ -200,6 +211,13 @@ SECTIONS
ASSERT(SIZEOF(.got.plt) == 0 || SIZEOF(.got.plt) == 0x18,
"Unexpected GOT/PLT entries detected!")
+#ifdef CONFIG_TRACING
+ .rodata.hyp_events : {
+ __hyp_events_start = .;
+ *(_hyp_events)
+ __hyp_events_end = .;
+ }
+#endif
/* code sections that are never executed via the kernel mapping */
.rodata.text : {
TRAMP_TEXT
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 414a298a61f7..5b34c6a261c0 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -25,7 +25,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o
-kvm-$(CONFIG_TRACING) += hyp_trace.o
+kvm-$(CONFIG_TRACING) += hyp_events.o hyp_trace.o
always-y := hyp_constants.h hyp-constants.s
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2242862e3223..fdee8f669def 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2661,6 +2661,8 @@ static int __init init_hyp_mode(void)
kvm_hyp_init_symbols();
+ hyp_trace_init_events();
+
if (is_protected_kvm_enabled()) {
if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) &&
cpus_have_final_cap(ARM64_HAS_ADDRESS_AUTH))
diff --git a/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h b/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h
new file mode 100644
index 000000000000..4b69d33e4f2d
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <asm/kvm_hypevents.h>
+
+#include <linux/arm-smccc.h>
+
+#undef arm_smccc_1_1_smc
+#define arm_smccc_1_1_smc(...) \
+ do { \
+ trace_hyp_exit(); \
+ __arm_smccc_1_1(SMCCC_SMC_INST, __VA_ARGS__); \
+ trace_hyp_enter(); \
+ } while (0)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/define_events.h b/arch/arm64/kvm/hyp/include/nvhe/define_events.h
new file mode 100644
index 000000000000..3947c1e47ef4
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/nvhe/define_events.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef HYP_EVENT_FILE
+# define __HYP_EVENT_FILE <asm/kvm_hypevents.h>
+#else
+# define __HYP_EVENT_FILE __stringify(HYP_EVENT_FILE)
+#endif
+
+#undef HYP_EVENT
+#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \
+ atomic_t __ro_after_init __name##_enabled = ATOMIC_INIT(0); \
+ struct hyp_event_id hyp_event_id_##__name \
+ __section(".hyp.event_ids") = { \
+ .data = (void *)&__name##_enabled, \
+ }
+
+#define HYP_EVENT_MULTI_READ
+#include __HYP_EVENT_FILE
+#undef HYP_EVENT_MULTI_READ
+
+#undef HYP_EVENT
diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h
index 29e0c7a6c867..d947ac2ebdf8 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/trace.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h
@@ -2,6 +2,7 @@
#ifndef __ARM64_KVM_HYP_NVHE_TRACE_H
#define __ARM64_KVM_HYP_NVHE_TRACE_H
#include <asm/kvm_hyptrace.h>
+#include <asm/kvm_hypevents_defs.h>
/* Internal struct that needs export for hyp-constants.c */
struct hyp_buffer_page {
@@ -15,18 +16,40 @@ struct hyp_buffer_page {
#ifdef CONFIG_TRACING
void *tracing_reserve_entry(unsigned long length);
void tracing_commit_entry(void);
+#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \
+ HYP_EVENT_FORMAT(__name, __struct); \
+ extern atomic_t __name##_enabled; \
+ extern unsigned short hyp_event_id_##__name; \
+ static inline void trace_##__name(__proto) \
+ { \
+ size_t length = sizeof(struct trace_hyp_format_##__name); \
+ struct trace_hyp_format_##__name *__entry; \
+ \
+ if (!atomic_read(&__name##_enabled)) \
+ return; \
+ __entry = tracing_reserve_entry(length); \
+ if (!__entry) \
+ return; \
+ __entry->hdr.id = hyp_event_id_##__name; \
+ __assign \
+ tracing_commit_entry(); \
+ }
int __pkvm_load_tracing(unsigned long desc_va, size_t desc_size);
void __pkvm_teardown_tracing(void);
int __pkvm_enable_tracing(bool enable);
int __pkvm_swap_reader_tracing(int cpu);
+int __pkvm_enable_event(unsigned short id, bool enable);
#else
static inline void *tracing_reserve_entry(unsigned long length) { return NULL; }
static inline void tracing_commit_entry(void) { }
+#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \
+ static inline void trace_##__name(__proto) {}
static inline int __pkvm_load_tracing(unsigned long desc_va, size_t desc_size) { return -ENODEV; }
static inline void __pkvm_teardown_tracing(void) { }
static inline int __pkvm_enable_tracing(bool enable) { return -ENODEV; }
static inline int __pkvm_swap_reader_tracing(int cpu) { return -ENODEV; }
+static inline int __pkvm_enable_event(unsigned short id, bool enable) { return -ENODEV; }
#endif
#endif
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 1a5367c13407..20a36dd3a7d0 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -26,7 +26,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o
hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
-hyp-obj-$(CONFIG_TRACING) += clock.o trace.o
+hyp-obj-$(CONFIG_TRACING) += clock.o events.o trace.o
hyp-obj-y += $(lib-objs)
##
diff --git a/arch/arm64/kvm/hyp/nvhe/events.c b/arch/arm64/kvm/hyp/nvhe/events.c
new file mode 100644
index 000000000000..ad214f3f698c
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/events.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ */
+
+#include <nvhe/mm.h>
+#include <nvhe/trace.h>
+
+#include <nvhe/define_events.h>
+
+extern struct hyp_event_id __hyp_event_ids_start[];
+extern struct hyp_event_id __hyp_event_ids_end[];
+
+int __pkvm_enable_event(unsigned short id, bool enable)
+{
+ struct hyp_event_id *event_id = __hyp_event_ids_start;
+ atomic_t *enable_key;
+
+ for (; (unsigned long)event_id < (unsigned long)__hyp_event_ids_end;
+ event_id++) {
+ if (event_id->id != id)
+ continue;
+
+ enable_key = (atomic_t *)event_id->data;
+ enable_key = hyp_fixmap_map(__hyp_pa(enable_key));
+
+ atomic_set(enable_key, enable);
+
+ hyp_fixmap_unmap();
+
+ return 0;
+ }
+
+ return -EINVAL;
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c
index e715c157c2c4..d17ef3771e6a 100644
--- a/arch/arm64/kvm/hyp/nvhe/ffa.c
+++ b/arch/arm64/kvm/hyp/nvhe/ffa.c
@@ -26,10 +26,10 @@
* the duration and are therefore serialised.
*/
-#include <linux/arm-smccc.h>
#include <linux/arm_ffa.h>
#include <asm/kvm_pkvm.h>
+#include <nvhe/arm-smccc.h>
#include <nvhe/ffa.h>
#include <nvhe/mem_protect.h>
#include <nvhe/memory.h>
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index d3d01c02d35c..f6119dbdcce1 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -11,6 +11,7 @@
#include <asm/kvm_emulate.h>
#include <asm/kvm_host.h>
#include <asm/kvm_hyp.h>
+#include <asm/kvm_hypevents.h>
#include <asm/kvm_mmu.h>
#include <nvhe/ffa.h>
@@ -403,6 +404,14 @@ static void handle___pkvm_swap_reader_tracing(struct kvm_cpu_context *host_ctxt)
cpu_reg(host_ctxt, 1) = __pkvm_swap_reader_tracing(cpu);
}
+static void handle___pkvm_enable_event(struct kvm_cpu_context *host_ctxt)
+{
+ DECLARE_REG(unsigned short, id, host_ctxt, 1);
+ DECLARE_REG(bool, enable, host_ctxt, 2);
+
+ cpu_reg(host_ctxt, 1) = __pkvm_enable_event(id, enable);
+}
+
typedef void (*hcall_t)(struct kvm_cpu_context *);
#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -439,6 +448,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__pkvm_teardown_tracing),
HANDLE_FUNC(__pkvm_enable_tracing),
HANDLE_FUNC(__pkvm_swap_reader_tracing),
+ HANDLE_FUNC(__pkvm_enable_event),
};
static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
@@ -479,7 +489,9 @@ static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
static void default_host_smc_handler(struct kvm_cpu_context *host_ctxt)
{
+ trace_hyp_exit();
__kvm_hyp_host_forward_smc(host_ctxt);
+ trace_hyp_enter();
}
static void handle_host_smc(struct kvm_cpu_context *host_ctxt)
@@ -503,6 +515,8 @@ void handle_trap(struct kvm_cpu_context *host_ctxt)
{
u64 esr = read_sysreg_el2(SYS_ESR);
+ trace_hyp_enter();
+
switch (ESR_ELx_EC(esr)) {
case ESR_ELx_EC_HVC64:
handle_host_hcall(host_ctxt);
@@ -522,4 +536,6 @@ void handle_trap(struct kvm_cpu_context *host_ctxt)
default:
BUG();
}
+
+ trace_hyp_exit();
}
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
index f4562f417d3f..9d0ce68f1ced 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
@@ -16,6 +16,10 @@ SECTIONS {
HYP_SECTION(.text)
HYP_SECTION(.data..ro_after_init)
HYP_SECTION(.rodata)
+#ifdef CONFIG_TRACING
+ . = ALIGN(PAGE_SIZE);
+ HYP_SECTION(.event_ids)
+#endif
/*
* .hyp..data..percpu needs to be page aligned to maintain the same
diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
index dfe8fe0f7eaf..1315fb6df3a3 100644
--- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c
+++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
@@ -6,11 +6,12 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_hyp.h>
+#include <asm/kvm_hypevents.h>
#include <asm/kvm_mmu.h>
-#include <linux/arm-smccc.h>
#include <linux/kvm_host.h>
#include <uapi/linux/psci.h>
+#include <nvhe/arm-smccc.h>
#include <nvhe/memory.h>
#include <nvhe/trap_handler.h>
@@ -153,6 +154,7 @@ static int psci_cpu_suspend(u64 func_id, struct kvm_cpu_context *host_ctxt)
DECLARE_REG(u64, power_state, host_ctxt, 1);
DECLARE_REG(unsigned long, pc, host_ctxt, 2);
DECLARE_REG(unsigned long, r0, host_ctxt, 3);
+ int ret;
struct psci_boot_args *boot_args;
struct kvm_nvhe_init_params *init_params;
@@ -171,9 +173,11 @@ static int psci_cpu_suspend(u64 func_id, struct kvm_cpu_context *host_ctxt)
* Will either return if shallow sleep state, or wake up into the entry
* point if it is a deep sleep state.
*/
- return psci_call(func_id, power_state,
- __hyp_pa(&kvm_hyp_cpu_resume),
- __hyp_pa(init_params));
+ ret = psci_call(func_id, power_state,
+ __hyp_pa(&kvm_hyp_cpu_resume),
+ __hyp_pa(init_params));
+
+ return ret;
}
static int psci_system_suspend(u64 func_id, struct kvm_cpu_context *host_ctxt)
@@ -205,6 +209,7 @@ asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on)
struct psci_boot_args *boot_args;
struct kvm_cpu_context *host_ctxt;
+ trace_hyp_enter();
host_ctxt = host_data_ptr(host_ctxt);
if (is_cpu_on)
@@ -218,6 +223,7 @@ asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on)
if (is_cpu_on)
release_boot_args(boot_args);
+ trace_hyp_exit();
__host_enter(host_ctxt);
}
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 6af179c6356d..572313a5a448 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -7,7 +7,6 @@
#include <hyp/switch.h>
#include <hyp/sysreg-sr.h>
-#include <linux/arm-smccc.h>
#include <linux/kvm_host.h>
#include <linux/types.h>
#include <linux/jump_label.h>
@@ -21,6 +20,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
+#include <asm/kvm_hypevents.h>
#include <asm/kvm_mmu.h>
#include <asm/fpsimd.h>
#include <asm/debug-monitors.h>
@@ -328,10 +328,13 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
__debug_switch_to_guest(vcpu);
do {
+ trace_hyp_exit();
+
/* Jump in the fire! */
exit_code = __guest_enter(vcpu);
/* And we're baaack! */
+ trace_hyp_enter();
} while (fixup_guest_exit(vcpu, &exit_code));
__sysreg_save_state_nvhe(guest_ctxt);
diff --git a/arch/arm64/kvm/hyp_events.c b/arch/arm64/kvm/hyp_events.c
new file mode 100644
index 000000000000..5ea468b5671f
--- /dev/null
+++ b/arch/arm64/kvm/hyp_events.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 Google LLC
+ */
+
+#include <linux/tracefs.h>
+
+#include <asm/kvm_host.h>
+#include <asm/kvm_define_hypevents.h>
+#include <asm/setup.h>
+
+#include "hyp_trace.h"
+
+extern struct hyp_event __hyp_events_start[];
+extern struct hyp_event __hyp_events_end[];
+
+/* hyp_event section used by the hypervisor */
+extern struct hyp_event_id __hyp_event_ids_start[];
+extern struct hyp_event_id __hyp_event_ids_end[];
+
+static ssize_t
+hyp_event_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos)
+{
+ struct seq_file *seq_file = (struct seq_file *)filp->private_data;
+ struct hyp_event *evt = (struct hyp_event *)seq_file->private;
+ unsigned short id = evt->id;
+ bool enabling;
+ int ret;
+ char c;
+
+ if (!cnt || cnt > 2)
+ return -EINVAL;
+
+ if (get_user(c, ubuf))
+ return -EFAULT;
+
+ switch (c) {
+ case '1':
+ enabling = true;
+ break;
+ case '0':
+ enabling = false;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (enabling != *evt->enabled) {
+ ret = kvm_call_hyp_nvhe(__pkvm_enable_event, id, enabling);
+ if (ret)
+ return ret;
+ }
+
+ *evt->enabled = enabling;
+
+ return cnt;
+}
+
+static int hyp_event_show(struct seq_file *m, void *v)
+{
+ struct hyp_event *evt = (struct hyp_event *)m->private;
+
+ seq_printf(m, "%d\n", *evt->enabled);
+
+ return 0;
+}
+
+static int hyp_event_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, hyp_event_show, inode->i_private);
+}
+
+static const struct file_operations hyp_event_fops = {
+ .open = hyp_event_open,
+ .write = hyp_event_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int hyp_event_id_show(struct seq_file *m, void *v)
+{
+ struct hyp_event *evt = (struct hyp_event *)m->private;
+
+ seq_printf(m, "%d\n", evt->id);
+
+ return 0;
+}
+
+static int hyp_event_id_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, hyp_event_id_show, inode->i_private);
+}
+
+static const struct file_operations hyp_event_id_fops = {
+ .open = hyp_event_id_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+void hyp_trace_init_event_tracefs(struct dentry *parent)
+{
+ struct hyp_event *event = __hyp_events_start;
+
+ parent = tracefs_create_dir("events", parent);
+ if (!parent) {
+ pr_err("Failed to create tracefs folder for hyp events\n");
+ return;
+ }
+
+ parent = tracefs_create_dir("hyp", parent);
+ if (!parent) {
+ pr_err("Failed to create tracefs folder for hyp events\n");
+ return;
+ }
+
+ for (; (unsigned long)event < (unsigned long)__hyp_events_end; event++) {
+ struct dentry *event_dir = tracefs_create_dir(event->name, parent);
+
+ if (!event_dir) {
+ pr_err("Failed to create events/hyp/%s\n", event->name);
+ continue;
+ }
+
+ tracefs_create_file("enable", 0700, event_dir, (void *)event,
+ &hyp_event_fops);
+ tracefs_create_file("id", 0400, event_dir, (void *)event,
+ &hyp_event_id_fops);
+ }
+}
+
+struct hyp_event *hyp_trace_find_event(int id)
+{
+ struct hyp_event *event = __hyp_events_start + id;
+
+ if ((unsigned long)event > (unsigned long)__hyp_events_end)
+ return NULL;
+
+ return event;
+}
+
+/*
+ * Register hyp events and write their id into the hyp section _hyp_event_ids.
+ */
+int hyp_trace_init_events(void)
+{
+ struct hyp_event_id *hyp_event_id = __hyp_event_ids_start;
+ struct hyp_event *event = __hyp_events_start;
+ int id = 0;
+
+ for (; (unsigned long)event < (unsigned long)__hyp_events_end;
+ event++, hyp_event_id++, id++) {
+
+ /*
+ * Both the host and the hypervisor relies on the same hyp event
+ * declarations from kvm_hypevents.h. We have then a 1:1
+ * mapping.
+ */
+ event->id = hyp_event_id->id = id;
+ }
+
+ return 0;
+}
diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c
index 36c120aeabc0..d3a14ac8b3dd 100644
--- a/arch/arm64/kvm/hyp_trace.c
+++ b/arch/arm64/kvm/hyp_trace.c
@@ -7,10 +7,12 @@
#include <linux/arm-smccc.h>
#include <linux/list.h>
#include <linux/percpu-defs.h>
+#include <linux/trace_events.h>
#include <linux/tracefs.h>
#include <asm/kvm_host.h>
#include <asm/kvm_hyptrace.h>
+#include <asm/kvm_hypevents_defs.h>
#include "hyp_constants.h"
#include "hyp_trace.h"
@@ -460,6 +462,8 @@ static void ht_print_trace_cpu(struct ht_iterator *iter)
static int ht_print_trace_fmt(struct ht_iterator *iter)
{
+ struct hyp_event *e;
+
if (iter->lost_events)
trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n",
iter->ent_cpu, iter->lost_events);
@@ -467,6 +471,12 @@ static int ht_print_trace_fmt(struct ht_iterator *iter)
ht_print_trace_cpu(iter);
ht_print_trace_time(iter);
+ e = hyp_trace_find_event(iter->ent->id);
+ if (e)
+ e->trace_func(iter);
+ else
+ trace_seq_printf(&iter->seq, "Unknown event id %d\n", iter->ent->id);
+
return trace_seq_has_overflowed(&iter->seq) ? -EOVERFLOW : 0;
};
@@ -807,5 +817,7 @@ int hyp_trace_init_tracefs(void)
(void *)cpu, &hyp_trace_pipe_fops);
}
+ hyp_trace_init_event_tracefs(root);
+
return 0;
}
diff --git a/arch/arm64/kvm/hyp_trace.h b/arch/arm64/kvm/hyp_trace.h
index 14fc06c625a6..3ac648415bf9 100644
--- a/arch/arm64/kvm/hyp_trace.h
+++ b/arch/arm64/kvm/hyp_trace.h
@@ -3,26 +3,13 @@
#ifndef __ARM64_KVM_HYP_TRACE_H__
#define __ARM64_KVM_HYP_TRACE_H__
-#include <linux/trace_seq.h>
-#include <linux/workqueue.h>
-
-struct ht_iterator {
- struct trace_buffer *trace_buffer;
- int cpu;
- struct hyp_entry_hdr *ent;
- unsigned long lost_events;
- int ent_cpu;
- size_t ent_size;
- u64 ts;
- void *spare;
- size_t copy_leftover;
- struct trace_seq seq;
- struct delayed_work poll_work;
-};
-
#ifdef CONFIG_TRACING
int hyp_trace_init_tracefs(void);
+int hyp_trace_init_events(void);
+struct hyp_event *hyp_trace_find_event(int id);
+void hyp_trace_init_event_tracefs(struct dentry *parent);
#else
static inline int hyp_trace_init_tracefs(void) { return 0; }
+static inline int hyp_trace_init_events(void) { return 0; }
#endif
#endif
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC PATCH 11/11] KVM: arm64: Add kselftest for tracefs hyp tracefs
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (9 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 10/11] KVM: arm64: Add support for hyp events Vincent Donnefort
@ 2024-08-05 17:32 ` Vincent Donnefort
2024-08-06 20:11 ` [RFC PATCH 00/11] Tracefs support for pKVM Steven Rostedt
11 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-05 17:32 UTC (permalink / raw)
To: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton
Cc: kvmarm, will, qperret, kernel-team, Vincent Donnefort
Add a test to validate the newly introduced tracefs interface for the
pKVM hypervisor. This test covers the usage of extended timestamp and
coherence of the tracing clock.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 1a576ef28b7f..4ed5fac75981 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -84,6 +84,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_enable_tracing,
__KVM_HOST_SMCCC_FUNC___pkvm_swap_reader_tracing,
__KVM_HOST_SMCCC_FUNC___pkvm_enable_event,
+ __KVM_HOST_SMCCC_FUNC___pkvm_selftest_event,
};
#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
diff --git a/arch/arm64/include/asm/kvm_hypevents.h b/arch/arm64/include/asm/kvm_hypevents.h
index 0b98a87a1250..1c797b748ff2 100644
--- a/arch/arm64/include/asm/kvm_hypevents.h
+++ b/arch/arm64/include/asm/kvm_hypevents.h
@@ -28,4 +28,14 @@ HYP_EVENT(hyp_exit,
),
HE_PRINTK(" ")
);
+
+#ifdef CONFIG_PROTECTED_NVHE_TESTING
+HYP_EVENT(selftest,
+ HE_PROTO(void),
+ HE_STRUCT(),
+ HE_ASSIGN(),
+ HE_PRINTK(" ")
+);
#endif
+
+#endif /* __ARM64_KVM_HYPEVENTS_H_ */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 58f09370d17e..f3906cb733fe 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -65,4 +65,13 @@ config PROTECTED_NVHE_STACKTRACE
If unsure, or not using protected nVHE (pKVM), say N.
+config PROTECTED_NVHE_TESTING
+ bool "Protected KVM hypervisor testing infrastructure"
+ depends on KVM
+ default n
+ help
+ Say Y here to enable pKVM hypervisor testing infrastructure.
+
+ If unsure, say N.
+
endif # VIRTUALIZATION
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index f6119dbdcce1..f9991ba9744a 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -412,6 +412,19 @@ static void handle___pkvm_enable_event(struct kvm_cpu_context *host_ctxt)
cpu_reg(host_ctxt, 1) = __pkvm_enable_event(id, enable);
}
+static void handle___pkvm_selftest_event(struct kvm_cpu_context *host_ctxt)
+{
+ int smc_ret = SMCCC_RET_NOT_SUPPORTED, ret = -EOPNOTSUPP;
+
+#ifdef CONFIG_PROTECTED_NVHE_TESTING
+ trace_selftest();
+ smc_ret = SMCCC_RET_SUCCESS;
+ ret = 0;
+#endif
+ cpu_reg(host_ctxt, 0) = smc_ret;
+ cpu_reg(host_ctxt, 1) = ret;
+}
+
typedef void (*hcall_t)(struct kvm_cpu_context *);
#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -449,6 +462,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__pkvm_enable_tracing),
HANDLE_FUNC(__pkvm_swap_reader_tracing),
HANDLE_FUNC(__pkvm_enable_event),
+ HANDLE_FUNC(__pkvm_selftest_event),
};
static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c
index d3a14ac8b3dd..143c7edf7d17 100644
--- a/arch/arm64/kvm/hyp_trace.c
+++ b/arch/arm64/kvm/hyp_trace.c
@@ -765,6 +765,36 @@ static const struct file_operations hyp_trace_fops = {
.release = NULL,
};
+#ifdef CONFIG_PROTECTED_NVHE_TESTING
+static int selftest_event_open(struct inode *inode, struct file *file)
+{
+ if (file->f_mode & FMODE_WRITE)
+ return kvm_call_hyp_nvhe(__pkvm_selftest_event);
+
+ return 0;
+}
+
+static ssize_t selftest_event_write(struct file *f, const char __user *buf,
+ size_t cnt, loff_t *pos)
+{
+ return cnt;
+}
+
+static const struct file_operations selftest_event_fops = {
+ .open = selftest_event_open,
+ .write = selftest_event_write,
+ .llseek = no_llseek,
+};
+
+static void hyp_trace_init_testing_tracefs(struct dentry *root)
+{
+ tracefs_create_file("selftest_event", TRACEFS_MODE_WRITE, root, NULL,
+ &selftest_event_fops);
+}
+#else
+static void hyp_trace_init_testing_tracefs(struct dentry *root) { }
+#endif
+
int hyp_trace_init_tracefs(void)
{
struct dentry *root, *per_cpu_root;
@@ -818,6 +848,7 @@ int hyp_trace_init_tracefs(void)
}
hyp_trace_init_event_tracefs(root);
+ hyp_trace_init_testing_tracefs(root);
return 0;
}
diff --git a/tools/testing/selftests/hyp-trace/Makefile b/tools/testing/selftests/hyp-trace/Makefile
new file mode 100644
index 000000000000..2a5b2e29667e
--- /dev/null
+++ b/tools/testing/selftests/hyp-trace/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+TEST_PROGS := hyp-trace-test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/hyp-trace/config b/tools/testing/selftests/hyp-trace/config
new file mode 100644
index 000000000000..39cee8ec30fa
--- /dev/null
+++ b/tools/testing/selftests/hyp-trace/config
@@ -0,0 +1,4 @@
+CONFIG_FTRACE=y
+CONFIG_ARM64=y
+CONFIG_KVM=y
+CONFIG_PROTECTED_NVHE_TESTING=y
diff --git a/tools/testing/selftests/hyp-trace/hyp-trace-test b/tools/testing/selftests/hyp-trace/hyp-trace-test
new file mode 100644
index 000000000000..7f010fba385a
--- /dev/null
+++ b/tools/testing/selftests/hyp-trace/hyp-trace-test
@@ -0,0 +1,161 @@
+#!/bin/sh -e
+# SPDX-License-Identifier: GPL-2.0-only
+
+# hyp-trace-test - Tracefs for pKVM hypervisor test
+#
+# Copyright (C) 2024 - Google LLC
+# Author: Vincent Donnefort <vdonnefort@google.com>
+#
+
+log_and_die()
+{
+ echo "$1"
+
+ exit 1
+}
+
+host_clock()
+{
+ # BOOTTIME clock
+ awk '{print $1}' /proc/uptime
+}
+
+goto_hyp_trace()
+{
+ if [ -d "/sys/kernel/debug/tracing/hyp" ]; then
+ cd /sys/kernel/debug/tracing/hyp
+ return
+ fi
+
+ if [ -d "/sys/kernel/tracing/hyp" ]; then
+ cd /sys/kernel/tracing/hyp
+ return
+ fi
+
+ echo "ERROR: hyp tracing folder not found!"
+
+ exit 1
+}
+
+reset_hyp_trace()
+{
+ echo 0 > tracing_on
+ echo 0 > trace
+ for event in events/hyp/*; do
+ echo 0 > $event/enable
+ done
+}
+
+setup_hyp_trace()
+{
+ goto_hyp_trace
+
+ reset_hyp_trace
+
+ echo 16 > buffer_size_kb
+ echo 1 > events/hyp/selftest/enable
+ echo 1 > tracing_on
+}
+
+stop_hyp_trace()
+{
+ echo 0 > tracing_on
+}
+
+write_events()
+{
+ local num="$1"
+ local func="$2"
+
+ for i in $(seq 1 $num); do
+ echo 1 > selftest_event
+ [ -z "$func" -o $i -eq $num ] || eval $func
+ done
+}
+
+consuming_read()
+{
+ local output=$1
+
+ nohup cat trace_pipe > $output &
+
+ echo $!
+}
+
+run_test_consuming()
+{
+ local nr_events=$1
+ local func=$2
+ local tmp="$(mktemp)"
+ local start_ts=0
+ local end_ts=0
+ local pid=0
+
+ echo "Output trace file: $tmp"
+
+ setup_hyp_trace
+ pid=$(consuming_read $tmp)
+
+ start_ts=$(host_clock)
+ write_events $nr_events $func
+ stop_hyp_trace
+ end_ts=$(host_clock)
+
+ kill $pid
+ validate_test $tmp $nr_events $start_ts $end_ts
+
+ rm $tmp
+}
+
+validate_test()
+{
+ local output=$1
+ local expected_events=$2
+ local start_ts=$3
+ local end_ts=$4
+ local prev_ts=$3
+ local ts=0
+ local num_events=0
+
+ IFS=$'\n'
+ for line in $(cat $output); do
+ echo "$line" | grep -q -E "^# " && continue
+ ts=$(echo "$line" | awk '{print $2}' | cut -d ':' -f1)
+ if [ $(echo "$ts<$prev_ts" | bc) -eq 1 ]; then
+ log_and_die "Error event @$ts < $prev_ts"
+ fi
+ prev_ts=$ts
+ num_events=$((num_events + 1))
+ done
+
+ if [ $(echo "$ts>$end_ts" | bc) -eq 1 ]; then
+ log_and_die "Error event @$ts > $end_ts"
+ fi
+
+ if [ $num_events -ne $expected_events ]; then
+ log_and_die "Expected $expected_events events, got $num_events"
+ fi
+}
+
+test_base()
+{
+ echo "Test consuming read..."
+
+ run_test_consuming 1000
+
+ echo "done."
+}
+
+test_extended_ts()
+{
+ echo "Test Extended timestamp..."
+
+ run_test_consuming 1000 "sleep 0.1"
+
+ echo "done."
+}
+
+test_base
+test_extended_ts
+
+exit 0
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 00/11] Tracefs support for pKVM
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
` (10 preceding siblings ...)
2024-08-05 17:32 ` [RFC PATCH 11/11] KVM: arm64: Add kselftest for tracefs hyp tracefs Vincent Donnefort
@ 2024-08-06 20:11 ` Steven Rostedt
2024-08-07 16:39 ` Vincent Donnefort
11 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2024-08-06 20:11 UTC (permalink / raw)
To: Vincent Donnefort
Cc: mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm, will,
qperret, kernel-team
Hi Vincent,
Thanks for sending this!
On Mon, 5 Aug 2024 18:32:23 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:
> The growing set of features supported by the hypervisor in protected
> mode necessitates debugging and profiling tools. Tracefs is the
> ideal candidate for this task:
>
> * It is simple to use and to script.
>
> * It is supported by various tools, from the trace-cmd CLI to the
> Android web-based perfetto.
>
> * The ring-buffer, where are stored trace events consists of linked
> pages, making it an ideal structure for sharing between kernel and
> hypervisor.
>
> This series introduces a method to create events and to generate them
> from the hypervisor (hyp_enter/hyp_exit given as an example) as well as
> a Tracefs user-space interface to read them.
>
> A presentation was given on this matter during the tracing summit in
> 2022. [1]
>
> 1. ring-buffer
> --------------
>
> To setup the per-cpu ring-buffers, a new interface is created:
>
> ring_buffer_writer: Describes what the kernel needs to know about the
> writer, that is, the set of pages forming the
> ring-buffer and a callback for the reader/head
> swapping (enables consuming read)
>
> ring_buffer_reader(): Creates a read-only ring-buffer from a
> ring_buffer_writer.
>
> To keep the internals of `struct ring_buffer` in sync with the writer,
> the meta-page is used. It was originally introduced to enable user-space
> mapping of the ring-buffer [1]. In this case, the kernel is not the
> producer anymore but the reader. The function to read that meta-page is:
>
> ring_buffer_poll_writer():
> Update `struct ring_buffer` based on the writer
> meta-page. Wake-up readers if necessary.
>
> The kernel has to poll the meta-page to be notified of newly written
> events.
>
> 2. Tracefs interface
> --------------------
>
> The interface is a hyp/ folder at the root of the tracefs mount point.
> This folder is like an instance and you'll find there a subset of the
> regular Tracefs user-space interface:
>
> hyp/
Hmm, do we really need to shorten it? Why not just call it "hypervisor". I
mean tab completion helps with the typing.
> buffer_size_kb
> trace_pipe
> trace_pipe_raw
> trace
> per_cpu/
> cpuX/
> trace_pipe
> trace_pipe_raw
> events/
> hyp/
> hyp_enter/
> enable
> id
>
> Behind the scenes, kvm/hyp_trace.c must rebuild the tracing hierarchy
> without relying on kernel/trace/trace.c. This is due to fundamental
> differences:
>
> * Hypervisor tracing doesn't support trace_array's system-specific
> features (snapshots, tracers, etc.).
>
> * Logged event formats differ (e.g., no PID in hypervisor
> events).
>
> * Buffer operations require specific hypervisor interactions.
>
> 3. Events
> ---------
>
> In the hypervisor, "hyp events" can be generated with trace_<event_name>
> in a similar fashion to what the kernel does. They're also created with
> similar macros than the kernel (see kvm_hypevents.h)
>
> HYP_EVENT("foboar",
> HE_PROTO(void),
> HE_STRUCT(),
> HE_ASSIGN(),
> HE_PRINTK(" ")
> )
>
> Despite the apparent similarities with TRACE_EVENT(), those macros
> internally differs: they must be used in parallel between the hypervisor
> (for the writing part) and the kernel (for the reading part) which makes
> it difficult to share anything with their kernel counterpart.
>
> Also, events directory isn't using eventfs.
>
> 4. Few limitations:
> -------------------
>
> Non consuming reading of the buffer isn't supported (i.e. cat trace) due
> to the lack of support in the ring-buffer meta-page.
Hmm, I don't think it should be hard to support that. I've been looking
into it for the user mapping. But that can be added later. For now, perhaps
"cat trace" just returns -EPERM?
>
> Tracing must be stopped for the buffer to be reset. i.e. (echo 0 >
> tracing_on; echo 0 > trace)
Hmm, why this? I haven't looked at the patches yet, but why can't the
write to trace just stop tracing and re-enable it after the reset?
>
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer
2024-08-05 17:32 ` [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer Vincent Donnefort
@ 2024-08-06 20:39 ` Steven Rostedt
2024-08-13 14:21 ` Vincent Donnefort
0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2024-08-06 20:39 UTC (permalink / raw)
To: Vincent Donnefort
Cc: mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm, will,
qperret, kernel-team
On Mon, 5 Aug 2024 18:32:25 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:
> +
> +#define for_each_rb_page_desc(__pdesc, __cpu, __trace_pdesc) \
> + for (__pdesc = (struct rb_page_desc *)&((__trace_pdesc)->__data[0]), __cpu = 0; \
> + __cpu < (__trace_pdesc)->nr_cpus; \
> + __cpu++, __pdesc = __next_rb_page_desc(__pdesc))
> +
> +static inline
> +struct rb_page_desc *rb_page_desc(struct trace_page_desc *trace_pdesc, int cpu)
> +{
> + struct rb_page_desc *pdesc;
> + int i;
> +
> + if (!trace_pdesc)
> + return NULL;
> +
> + for_each_rb_page_desc(pdesc, i, trace_pdesc) {
> + if (pdesc->cpu == cpu)
Is there a reason for the linear search?
Why not just:
if (cpu >= trace_pdesc->nr_cpus)
return NULL;
len = struct_size(pdesc, page_va, pdesc->nr_page_va);
pdesc = (void *)&(trace_pdesc->__data[0]);
return pdesc + len * cpu;
(note I don't think you need to typecast the void pointer).
> + return pdesc;
> + }
> +
> + return NULL;
> +}
> +
> +static inline
> +void *rb_page_desc_page(struct rb_page_desc *pdesc, int page_id)
> +{
> + return page_id > pdesc->nr_page_va ? NULL : (void *)pdesc->page_va[page_id];
> +}
> +
> +struct ring_buffer_writer {
> + struct trace_page_desc *pdesc;
> + int (*get_reader_page)(int cpu);
> +};
> +
> +int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu);
> +
> +#define ring_buffer_reader(writer) \
> +({ \
> + static struct lock_class_key __key; \
> + __ring_buffer_alloc(0, RB_FL_OVERWRITE, &__key, writer);\
> +})
> #endif /* _LINUX_RING_BUFFER_H */
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index f4f4dda28077..a07c22254cfd 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -495,6 +495,8 @@ struct ring_buffer_per_cpu {
> unsigned long *subbuf_ids; /* ID to subbuf VA */
> struct trace_buffer_meta *meta_page;
>
> + struct ring_buffer_writer *writer;
> +
> /* ring buffer pages to update, > 0 to add, < 0 to remove */
> long nr_pages_to_update;
> struct list_head new_pages; /* new pages to add */
> @@ -517,6 +519,8 @@ struct trace_buffer {
>
> struct ring_buffer_per_cpu **buffers;
>
> + struct ring_buffer_writer *writer;
> +
> struct hlist_node node;
> u64 (*clock)(void);
>
> @@ -1626,6 +1630,31 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
>
> cpu_buffer->reader_page = bpage;
>
> + if (buffer->writer) {
> + struct rb_page_desc *pdesc = rb_page_desc(buffer->writer->pdesc, cpu);
> +
> + if (!pdesc)
> + goto fail_free_reader;
> +
> + cpu_buffer->writer = buffer->writer;
> + cpu_buffer->meta_page = (struct trace_buffer_meta *)(void *)pdesc->meta_va;
> + cpu_buffer->subbuf_ids = pdesc->page_va;
> + cpu_buffer->nr_pages = pdesc->nr_page_va - 1;
> + atomic_inc(&cpu_buffer->record_disabled);
> + atomic_inc(&cpu_buffer->resize_disabled);
> +
> + bpage->page = rb_page_desc_page(pdesc,
> + cpu_buffer->meta_page->reader.id);
> + if (!bpage->page)
> + goto fail_free_reader;
> + /*
> + * The meta-page can only describe which of the ring-buffer page
> + * is the reader. There is no need to init the rest of the
> + * ring-buffer.
> + */
> + return cpu_buffer;
> + }
> +
> page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
> cpu_buffer->buffer->subbuf_order);
> if (!page)
> @@ -1663,6 +1692,10 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
>
> irq_work_sync(&cpu_buffer->irq_work.work);
>
> + if (cpu_buffer->writer)
> + /* the ring_buffer doesn't own the data pages */
> + cpu_buffer->reader_page->page = NULL;
Note, if statements are to have brackets if it's more than one line. That
even includes comments. So the above should be written either as:
if (cpu_buffer->writer) {
/* the ring_buffer doesn't own the data pages */
cpu_buffer->reader_page->page = NULL;
}
Or
/* the ring_buffer doesn't own the data pages */
if (cpu_buffer->writer)
cpu_buffer->reader_page->page = NULL;
For the second version, you should probably add more detail:
/* ring_buffers with writer set do not own the data pages */
if (cpu_buffer->writer)
cpu_buffer->reader_page->page = NULL;
> +
> free_buffer_page(cpu_buffer->reader_page);
>
> if (head) {
> @@ -1693,7 +1726,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> * drop data when the tail hits the head.
> */
> struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
> - struct lock_class_key *key)
> + struct lock_class_key *key,
> + struct ring_buffer_writer *writer)
> {
> struct trace_buffer *buffer;
> long nr_pages;
> @@ -1721,6 +1755,10 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
> buffer->flags = flags;
> buffer->clock = trace_clock_local;
> buffer->reader_lock_key = key;
> + if (writer) {
> + buffer->writer = writer;
Should probably add a comment here:
/* The writer is external and never done by the kernel */
or something like that.
> + atomic_inc(&buffer->record_disabled);
> + }
>
-- Steve
> init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters);
> init_waitqueue_head(&buffer->irq_work.waiters);
> @@ -4468,8 +4506,54 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
> }
> }
>
> +static bool rb_read_writer_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + local_set(&cpu_buffer->entries, READ_ONCE(cpu_buffer->meta_page->entries));
> + local_set(&cpu_buffer->overrun, READ_ONCE(cpu_buffer->meta_page->overrun));
> + local_set(&cpu_buffer->pages_touched, READ_ONCE(meta_pages_touched(cpu_buffer->meta_page)));
> + local_set(&cpu_buffer->pages_lost, READ_ONCE(meta_pages_lost(cpu_buffer->meta_page)));
> + /*
> + * No need to get the "read" field, it can be tracked here as any
> + * reader will have to go through a rign_buffer_per_cpu.
> + */
> +
> + return rb_num_of_entries(cpu_buffer);
> +}
> +
> +static struct buffer_page *
> +__rb_get_reader_page_from_writer(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + u32 prev_reader;
> +
> + if (!rb_read_writer_meta_page(cpu_buffer))
> + return NULL;
> +
> + /* More to read on the reader page */
> + if (cpu_buffer->reader_page->read < rb_page_size(cpu_buffer->reader_page))
> + return cpu_buffer->reader_page;
> +
> + prev_reader = cpu_buffer->meta_page->reader.id;
> +
> + WARN_ON(cpu_buffer->writer->get_reader_page(cpu_buffer->cpu));
> + /* nr_pages doesn't include the reader page */
> + if (cpu_buffer->meta_page->reader.id > cpu_buffer->nr_pages) {
> + WARN_ON(1);
> + return NULL;
> + }
> +
> + cpu_buffer->reader_page->page =
> + (void *)cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id];
> + cpu_buffer->reader_page->read = 0;
> + cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp;
> + cpu_buffer->lost_events = cpu_buffer->meta_page->reader.lost_events;
> +
> + WARN_ON(prev_reader == cpu_buffer->meta_page->reader.id);
> +
> + return cpu_buffer->reader_page;
> +}
> +
> static struct buffer_page *
> -rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> +__rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> {
> struct buffer_page *reader = NULL;
> unsigned long bsize = READ_ONCE(cpu_buffer->buffer->subbuf_size);
> @@ -4636,6 +4720,13 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> return reader;
> }
>
> +static struct buffer_page *
> +rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + return cpu_buffer->writer ? __rb_get_reader_page_from_writer(cpu_buffer) :
> + __rb_get_reader_page(cpu_buffer);
> +}
> +
> static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
> {
> struct ring_buffer_event *event;
> @@ -5040,7 +5131,7 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags)
> struct ring_buffer_per_cpu *cpu_buffer;
> struct ring_buffer_iter *iter;
>
> - if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->writer)
> return NULL;
>
> iter = kzalloc(sizeof(*iter), flags);
> @@ -5210,6 +5301,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
> {
> struct buffer_page *page;
>
> + if (cpu_buffer->writer)
> + return;
> +
> rb_head_page_deactivate(cpu_buffer);
>
> cpu_buffer->head_page
> @@ -5440,6 +5534,49 @@ bool ring_buffer_empty_cpu(struct trace_buffer *buffer, int cpu)
> }
> EXPORT_SYMBOL_GPL(ring_buffer_empty_cpu);
>
> +int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long flags;
> +
> + if (cpu != RING_BUFFER_ALL_CPUS) {
> + if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + return -EINVAL;
> +
> + cpu_buffer = buffer->buffers[cpu];
> +
> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> + if (rb_read_writer_meta_page(cpu_buffer))
> + rb_wakeups(buffer, cpu_buffer);
> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> +
> + return 0;
> + }
> +
> + /*
> + * Make sure all the ring buffers are up to date before we start reading
> + * them.
> + */
> + for_each_buffer_cpu(buffer, cpu) {
> + cpu_buffer = buffer->buffers[cpu];
> +
> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> + rb_read_writer_meta_page(buffer->buffers[cpu]);
> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> + }
> +
> + for_each_buffer_cpu(buffer, cpu) {
> + cpu_buffer = buffer->buffers[cpu];
> +
> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> + if (rb_num_of_entries(cpu_buffer))
> + rb_wakeups(buffer, buffer->buffers[cpu]);
> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> + }
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
> /**
> * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
> @@ -5691,6 +5828,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> unsigned int commit;
> unsigned int read;
> u64 save_timestamp;
> + bool force_memcpy;
> int ret = -1;
>
> if (!cpumask_test_cpu(cpu, buffer->cpumask))
> @@ -5728,6 +5866,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> /* Check if any events were dropped */
> missed_events = cpu_buffer->lost_events;
>
> + force_memcpy = cpu_buffer->mapped || cpu_buffer->writer;
> +
> /*
> * If this page has been partially read or
> * if len is not big enough to read the rest of the page or
> @@ -5737,7 +5877,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> */
> if (read || (len < (commit - read)) ||
> cpu_buffer->reader_page == cpu_buffer->commit_page ||
> - cpu_buffer->mapped) {
> + force_memcpy) {
> struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
> unsigned int rpos = read;
> unsigned int pos = 0;
> @@ -6290,7 +6430,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> unsigned long flags, *subbuf_ids;
> int err = 0;
>
> - if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->writer)
> return -EINVAL;
>
> cpu_buffer = buffer->buffers[cpu];
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 00/11] Tracefs support for pKVM
2024-08-06 20:11 ` [RFC PATCH 00/11] Tracefs support for pKVM Steven Rostedt
@ 2024-08-07 16:39 ` Vincent Donnefort
0 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-07 16:39 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm, will,
qperret, kernel-team
On Tue, Aug 06, 2024 at 04:11:38PM -0400, Steven Rostedt wrote:
>
> Hi Vincent,
>
> Thanks for sending this!
And thanks for already having a look at it!
[..]
> > 2. Tracefs interface
> > --------------------
> >
> > The interface is a hyp/ folder at the root of the tracefs mount point.
> > This folder is like an instance and you'll find there a subset of the
> > regular Tracefs user-space interface:
> >
> > hyp/
>
> Hmm, do we really need to shorten it? Why not just call it "hypervisor". I
> mean tab completion helps with the typing.
In most of the code we do refer to hyp, that's why we kept the naming here too.
But yeah we could expand it.
>
> > buffer_size_kb
> > trace_pipe
> > trace_pipe_raw
> > trace
> > per_cpu/
> > cpuX/
> > trace_pipe
> > trace_pipe_raw
> > events/
> > hyp/
> > hyp_enter/
> > enable
> > id
> >
[...]
> >
> > 4. Few limitations:
> > -------------------
> >
> > Non consuming reading of the buffer isn't supported (i.e. cat trace) due
> > to the lack of support in the ring-buffer meta-page.
>
> Hmm, I don't think it should be hard to support that. I've been looking
> into it for the user mapping. But that can be added later. For now, perhaps
> "cat trace" just returns -EPERM?
Yeah, I am sure that's something we can make work. But definitely not a priority
as it is less reliable than _pipe and unused by user-space tools I believe.
For now we print "Not supported yet". But happy to modify it to a EPERM.
>
> >
> > Tracing must be stopped for the buffer to be reset. i.e. (echo 0 >
> > tracing_on; echo 0 > trace)
>
> Hmm, why this? I haven't looked at the patches yet, but why can't the
> write to trace just stop tracing and re-enable it after the reset?
I could reset the buffers from the hypervisor with a dedicated hypercall.
However I'd still need a way to "teardown" the buffer, that is unsharing it with
the hypervisor and freeing the allocated memory. Using that reset for this
purpose was nice even though it implied to stop tracing in a first place.
Perhaps `echo 0 > trace` could reset the buffer if tracing_on=1 and teardown the
buffers if tracing_on=0?
Alternatively, I could use `echo 0 > buffer_size_kb` for the teardown. But I
prefer the former solution: interface users are more likely to just 0 tracing_on
and trace.
>
> >
>
> -- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer
2024-08-06 20:39 ` Steven Rostedt
@ 2024-08-13 14:21 ` Vincent Donnefort
2024-08-13 14:35 ` Steven Rostedt
0 siblings, 1 reply; 22+ messages in thread
From: Vincent Donnefort @ 2024-08-13 14:21 UTC (permalink / raw)
To: Steven Rostedt
Cc: mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm, will,
qperret, kernel-team
Sorry I have been preempted before I can answer here:
On Tue, Aug 06, 2024 at 04:39:53PM -0400, Steven Rostedt wrote:
> On Mon, 5 Aug 2024 18:32:25 +0100
> Vincent Donnefort <vdonnefort@google.com> wrote:
>
> > +
> > +#define for_each_rb_page_desc(__pdesc, __cpu, __trace_pdesc) \
> > + for (__pdesc = (struct rb_page_desc *)&((__trace_pdesc)->__data[0]), __cpu = 0; \
> > + __cpu < (__trace_pdesc)->nr_cpus; \
> > + __cpu++, __pdesc = __next_rb_page_desc(__pdesc))
> > +
> > +static inline
> > +struct rb_page_desc *rb_page_desc(struct trace_page_desc *trace_pdesc, int cpu)
> > +{
> > + struct rb_page_desc *pdesc;
> > + int i;
> > +
> > + if (!trace_pdesc)
> > + return NULL;
> > +
> > + for_each_rb_page_desc(pdesc, i, trace_pdesc) {
> > + if (pdesc->cpu == cpu)
>
> Is there a reason for the linear search?
>
> Why not just:
>
> if (cpu >= trace_pdesc->nr_cpus)
> return NULL;
>
> len = struct_size(pdesc, page_va, pdesc->nr_page_va);
> pdesc = (void *)&(trace_pdesc->__data[0]);
>
> return pdesc + len * cpu;
>
> (note I don't think you need to typecast the void pointer).
I supposed we can't assume buffers will be allocated for each CPU, hence the
need to look at each buffer.
>
> > + return pdesc;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +static inline
> > +void *rb_page_desc_page(struct rb_page_desc *pdesc, int page_id)
> > +{
> > + return page_id > pdesc->nr_page_va ? NULL : (void *)pdesc->page_va[page_id];
> > +}
> > +
> > +struct ring_buffer_writer {
> > + struct trace_page_desc *pdesc;
> > + int (*get_reader_page)(int cpu);
> > +};
> > +
> > +int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu);
> > +
> > +#define ring_buffer_reader(writer) \
> > +({ \
> > + static struct lock_class_key __key; \
> > + __ring_buffer_alloc(0, RB_FL_OVERWRITE, &__key, writer);\
> > +})
> > #endif /* _LINUX_RING_BUFFER_H */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index f4f4dda28077..a07c22254cfd 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -495,6 +495,8 @@ struct ring_buffer_per_cpu {
> > unsigned long *subbuf_ids; /* ID to subbuf VA */
> > struct trace_buffer_meta *meta_page;
> >
> > + struct ring_buffer_writer *writer;
> > +
> > /* ring buffer pages to update, > 0 to add, < 0 to remove */
> > long nr_pages_to_update;
> > struct list_head new_pages; /* new pages to add */
> > @@ -517,6 +519,8 @@ struct trace_buffer {
> >
> > struct ring_buffer_per_cpu **buffers;
> >
> > + struct ring_buffer_writer *writer;
> > +
> > struct hlist_node node;
> > u64 (*clock)(void);
> >
> > @@ -1626,6 +1630,31 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
> >
> > cpu_buffer->reader_page = bpage;
> >
> > + if (buffer->writer) {
> > + struct rb_page_desc *pdesc = rb_page_desc(buffer->writer->pdesc, cpu);
> > +
> > + if (!pdesc)
> > + goto fail_free_reader;
> > +
> > + cpu_buffer->writer = buffer->writer;
> > + cpu_buffer->meta_page = (struct trace_buffer_meta *)(void *)pdesc->meta_va;
> > + cpu_buffer->subbuf_ids = pdesc->page_va;
> > + cpu_buffer->nr_pages = pdesc->nr_page_va - 1;
> > + atomic_inc(&cpu_buffer->record_disabled);
> > + atomic_inc(&cpu_buffer->resize_disabled);
> > +
> > + bpage->page = rb_page_desc_page(pdesc,
> > + cpu_buffer->meta_page->reader.id);
> > + if (!bpage->page)
> > + goto fail_free_reader;
> > + /*
> > + * The meta-page can only describe which of the ring-buffer page
> > + * is the reader. There is no need to init the rest of the
> > + * ring-buffer.
> > + */
> > + return cpu_buffer;
> > + }
> > +
> > page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
> > cpu_buffer->buffer->subbuf_order);
> > if (!page)
> > @@ -1663,6 +1692,10 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> >
> > irq_work_sync(&cpu_buffer->irq_work.work);
> >
> > + if (cpu_buffer->writer)
> > + /* the ring_buffer doesn't own the data pages */
> > + cpu_buffer->reader_page->page = NULL;
>
> Note, if statements are to have brackets if it's more than one line. That
> even includes comments. So the above should be written either as:
>
> if (cpu_buffer->writer) {
> /* the ring_buffer doesn't own the data pages */
> cpu_buffer->reader_page->page = NULL;
> }
>
> Or
>
> /* the ring_buffer doesn't own the data pages */
> if (cpu_buffer->writer)
> cpu_buffer->reader_page->page = NULL;
>
> For the second version, you should probably add more detail:
>
> /* ring_buffers with writer set do not own the data pages */
> if (cpu_buffer->writer)
> cpu_buffer->reader_page->page = NULL;
>
>
> > +
> > free_buffer_page(cpu_buffer->reader_page);
> >
> > if (head) {
> > @@ -1693,7 +1726,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> > * drop data when the tail hits the head.
> > */
> > struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
> > - struct lock_class_key *key)
> > + struct lock_class_key *key,
> > + struct ring_buffer_writer *writer)
> > {
> > struct trace_buffer *buffer;
> > long nr_pages;
> > @@ -1721,6 +1755,10 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
> > buffer->flags = flags;
> > buffer->clock = trace_clock_local;
> > buffer->reader_lock_key = key;
> > + if (writer) {
> > + buffer->writer = writer;
>
> Should probably add a comment here:
>
> /* The writer is external and never done by the kernel */
>
> or something like that.
>
> > + atomic_inc(&buffer->record_disabled);
> > + }
> >
>
> -- Steve
>
[...]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer
2024-08-13 14:21 ` Vincent Donnefort
@ 2024-08-13 14:35 ` Steven Rostedt
0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2024-08-13 14:35 UTC (permalink / raw)
To: Vincent Donnefort
Cc: mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm, will,
qperret, kernel-team
On Tue, 13 Aug 2024 15:21:54 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:
> > > +
> > > +#define for_each_rb_page_desc(__pdesc, __cpu, __trace_pdesc) \
> > > + for (__pdesc = (struct rb_page_desc *)&((__trace_pdesc)->__data[0]), __cpu = 0; \
> > > + __cpu < (__trace_pdesc)->nr_cpus; \
> > > + __cpu++, __pdesc = __next_rb_page_desc(__pdesc))
> > > +
> > > +static inline
> > > +struct rb_page_desc *rb_page_desc(struct trace_page_desc *trace_pdesc, int cpu)
> > > +{
> > > + struct rb_page_desc *pdesc;
> > > + int i;
> > > +
> > > + if (!trace_pdesc)
> > > + return NULL;
> > > +
> > > + for_each_rb_page_desc(pdesc, i, trace_pdesc) {
> > > + if (pdesc->cpu == cpu)
> >
> > Is there a reason for the linear search?
> >
> > Why not just:
> >
> > if (cpu >= trace_pdesc->nr_cpus)
> > return NULL;
> >
> > len = struct_size(pdesc, page_va, pdesc->nr_page_va);
> > pdesc = (void *)&(trace_pdesc->__data[0]);
> >
> > return pdesc + len * cpu;
> >
> > (note I don't think you need to typecast the void pointer).
>
> I supposed we can't assume buffers will be allocated for each CPU, hence the
> need to look at each buffer.
OK, but by default that should be the fast path. We could add the above
and then do:
pdesc += len * cpu;
if (pdesc->cpu == cpu)
return pdesc;
/* Missing CPUs, need to do a linear search */
for_each_rb_page_desc(pdesc, i, trace_pdesc) {
if (pdesc->cpu == cpu)
[...]
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots
2024-08-05 17:32 ` [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots Vincent Donnefort
@ 2024-08-22 9:13 ` Marc Zyngier
2024-09-05 13:04 ` Vincent Donnefort
2024-08-22 18:13 ` John Stultz
1 sibling, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2024-08-22 9:13 UTC (permalink / raw)
To: Vincent Donnefort
Cc: rostedt, mhiramat, linux-trace-kernel, oliver.upton, kvmarm, will,
qperret, kernel-team
On Mon, 05 Aug 2024 18:32:27 +0100,
Vincent Donnefort <vdonnefort@google.com> wrote:
>
> On arm64 systems, the arch timer can be accessible by both EL1 and EL2.
> This means when running with nVHE or protected KVM, it is easy to
> generate clock values from the hypervisor, synchronized with the kernel.
When you say "arch_timer" here, are you talking about the data
structure describing the timer? Or about the actual *counter*, a
system register provided by the HW?
I'm not sure the architecture-specific details are massively relevant,
given that this is an arch-agnostic change.
>
> For tracing purpose, the boot clock is interesting as it doesn't stop on
> suspend. Export it as part of the time snapshot. This will later allow
> the hypervisor to add boot clock timestamps to its events.
Isn't that the actual description of the change? By getting the boot
time as well as the parameters to compute an increment, you allow any
subsystem able to perform a snapshot to compute a delta from boot time
as long as they have access to the counter source.
>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index fc12a9ba2c88..0fc6a61d64bd 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -275,18 +275,24 @@ struct ktime_timestamps {
> * counter value
> * @cycles: Clocksource counter value to produce the system times
> * @real: Realtime system time
> + * @boot: Boot time
> * @raw: Monotonic raw system time
> * @cs_id: Clocksource ID
> * @clock_was_set_seq: The sequence number of clock-was-set events
> * @cs_was_changed_seq: The sequence number of clocksource change events
> + * @mono_shift: The monotonic clock slope shift
> + * @mono_mult: The monotonic clock slope mult
> */
> struct system_time_snapshot {
> u64 cycles;
> ktime_t real;
> + ktime_t boot;
> ktime_t raw;
> enum clocksource_ids cs_id;
> unsigned int clock_was_set_seq;
> u8 cs_was_changed_seq;
> + u32 mono_shift;
> + u32 mono_mult;
> };
>
> /**
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 2fa87dcfeda9..6d0488a555a7 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1057,9 +1057,11 @@ noinstr time64_t __ktime_get_real_seconds(void)
> void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> {
> struct timekeeper *tk = &tk_core.timekeeper;
> + u32 mono_mult, mono_shift;
> unsigned int seq;
> ktime_t base_raw;
> ktime_t base_real;
> + ktime_t base_boot;
> u64 nsec_raw;
> u64 nsec_real;
> u64 now;
> @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
> base_real = ktime_add(tk->tkr_mono.base,
> tk_core.timekeeper.offs_real);
> + base_boot = ktime_add(tk->tkr_mono.base,
> + tk_core.timekeeper.offs_boot);
> base_raw = tk->tkr_raw.base;
> nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> + mono_mult = tk->tkr_mono.mult;
> + mono_shift = tk->tkr_mono.shift;
> } while (read_seqcount_retry(&tk_core.seq, seq));
>
> systime_snapshot->cycles = now;
> systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
> + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
> systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
> + systime_snapshot->mono_shift = mono_shift;
> + systime_snapshot->mono_mult = mono_mult;
> }
> EXPORT_SYMBOL_GPL(ktime_get_snapshot);
>
This looks good to me, but you should probably Cc the timekeeping
maintainers (tglx, John Stultz, and Stephen Boyd).
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots
2024-08-05 17:32 ` [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots Vincent Donnefort
2024-08-22 9:13 ` Marc Zyngier
@ 2024-08-22 18:13 ` John Stultz
2024-08-22 21:41 ` Thomas Gleixner
2024-09-05 13:17 ` Vincent Donnefort
1 sibling, 2 replies; 22+ messages in thread
From: John Stultz @ 2024-08-22 18:13 UTC (permalink / raw)
To: Vincent Donnefort
Cc: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm,
will, qperret, kernel-team, Thomas Gleixner
On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team
<kernel-team@android.com> wrote:
>
> On arm64 systems, the arch timer can be accessible by both EL1 and EL2.
> This means when running with nVHE or protected KVM, it is easy to
> generate clock values from the hypervisor, synchronized with the kernel.
>
> For tracing purpose, the boot clock is interesting as it doesn't stop on
> suspend. Export it as part of the time snapshot. This will later allow
> the hypervisor to add boot clock timestamps to its events.
>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index fc12a9ba2c88..0fc6a61d64bd 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -275,18 +275,24 @@ struct ktime_timestamps {
> * counter value
> * @cycles: Clocksource counter value to produce the system times
> * @real: Realtime system time
> + * @boot: Boot time
So, adding the boottime to this kernel-internal snapshot seems reasonable to me.
> * @raw: Monotonic raw system time
> * @cs_id: Clocksource ID
> * @clock_was_set_seq: The sequence number of clock-was-set events
> * @cs_was_changed_seq: The sequence number of clocksource change events
> + * @mono_shift: The monotonic clock slope shift
> + * @mono_mult: The monotonic clock slope mult
This bit, including the mult/shift pair however, isn't well explained
and is a little more worrying.
> @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
> base_real = ktime_add(tk->tkr_mono.base,
> tk_core.timekeeper.offs_real);
> + base_boot = ktime_add(tk->tkr_mono.base,
> + tk_core.timekeeper.offs_boot);
> base_raw = tk->tkr_raw.base;
> nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> + mono_mult = tk->tkr_mono.mult;
> + mono_shift = tk->tkr_mono.shift;
> } while (read_seqcount_retry(&tk_core.seq, seq));
>
> systime_snapshot->cycles = now;
> systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
> + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
> systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
> + systime_snapshot->mono_shift = mono_shift;
> + systime_snapshot->mono_mult = mono_mult;
> }
> EXPORT_SYMBOL_GPL(ktime_get_snapshot);
So this looks like you're trying to stuff kernel timekeeping internal
values into the snapshot so you can skirt around the timekeeping
subsystem and generate your own timestamps.
This ends up duplicating logic, but in an incomplete way. For
instance, you don't have things like ntp state, etc, so the timestamps
you generate will not exactly match the kernel, and may have
discontinuities. :(
Now for many cases "close enough" is fine. But the difficulty is the
expectation bar always raises, and eventually "close enough" isn't and
we have a broken interface that has to be fixed.
That said, I do get the need to have something like this is
legitimate. There have been a number of cases where external hardware
(PTP timestamps from NICs) or contexts (virt) are able to record
hardware clocksource timestamps on their own, and want to be able to
map that back to the kernel's (or maybe "a kernel's" if there are
multiple VMs) sense of time. Sometimes even wanting to do this quite
a bit later after the timestamp was recorded. The ktime_get_snapshot()
logic was added in the first place for this reason.
Some more aggressive approaches try to dump a bunch of the internal
kernel timekeeping state out to userland and call it an api.
See https://lore.kernel.org/lkml/410bbef9771ef8aa51704994a70d5965e367e2ce.camel@infradead.org/
for a recent (and thorough) effort there.
I'm very much not a fan of this approach, as it mimics older efforts
for userspace time calculations that were done before we settled on
VDSOs, which were very fragile and required years of keeping backwards
compatibility logic to map the current kernel state back to separate
structures and expensive conversions to different units that userland
expected.
The benefit with VDSO interface is while the data is exposed to
userland, the structure is not, and the logic is still kernel
controlled, so changes to internal state can be done without breaking
userland.
Something I have been thinking about is maybe it would be beneficial
to rework the timekeeping core so that given a clocksource timestamp,
it could calculate the time for that timestamp. While existing apis
would still do a new read of the clocksource, so the timestamps would
always increase, an old timestamp could be used to retro-calculate a
past time. The thing that prevents this now is that the timekeeping
core doesn't keep any history, so we can't correctly back-calculate
times before the last state change. But potentially we could keep a
buffer of timekeeper states associated with clocksource intervals, and
so we could find the right state to use for a given clocksource
timestamp. Now, this would still only work to a point, as we don't
want to keep tons of historical state. But then with this, maybe we
could switch to something more VDSO-like where the PTP drivers or host
systems could request a time given a timestamp (and probably some
clocksource id so we can sanity check everyone is using the same
clock), and we could still provide what they want without having to
expose all of our state.
Unfortunately though, this is all hand waving and pontificating on my
part, as it would be a large rework. But it seems something closer
where we share opaque kernel state along with logic with proper
syscall like APIs to do the calculations, would be a much better
approach over just exporting more kernel state as an API.
For a more short term approach, since you can't be exact outside of
the timekeeping logic, why not interpolate from the data
ktime_get_snapshot already provides to calculate your own sense of the
frequency?
thanks
-john
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots
2024-08-22 18:13 ` John Stultz
@ 2024-08-22 21:41 ` Thomas Gleixner
2024-09-05 13:17 ` Vincent Donnefort
1 sibling, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2024-08-22 21:41 UTC (permalink / raw)
To: John Stultz, Vincent Donnefort
Cc: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm,
will, qperret, kernel-team
On Thu, Aug 22 2024 at 11:13, John Stultz wrote:
> On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team
> <kernel-team@android.com> wrote:
>> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
>> index fc12a9ba2c88..0fc6a61d64bd 100644
>> --- a/include/linux/timekeeping.h
>> +++ b/include/linux/timekeeping.h
>> @@ -275,18 +275,24 @@ struct ktime_timestamps {
>> * counter value
>> * @cycles: Clocksource counter value to produce the system times
>> * @real: Realtime system time
>> + * @boot: Boot time
>
> So, adding the boottime to this kernel-internal snapshot seems reasonable to me.
Maybe for you, but I have zero context to this as this submission
obviously failed to CC the relevant mailing lists and maintainers...
Documentation/process is there for a reason...
Thanks,
tgkx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots
2024-08-22 9:13 ` Marc Zyngier
@ 2024-09-05 13:04 ` Vincent Donnefort
0 siblings, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-09-05 13:04 UTC (permalink / raw)
To: Marc Zyngier
Cc: rostedt, mhiramat, linux-trace-kernel, oliver.upton, kvmarm, will,
qperret, kernel-team
On Thu, Aug 22, 2024 at 10:13:34AM +0100, Marc Zyngier wrote:
> On Mon, 05 Aug 2024 18:32:27 +0100,
> Vincent Donnefort <vdonnefort@google.com> wrote:
> >
> > On arm64 systems, the arch timer can be accessible by both EL1 and EL2.
> > This means when running with nVHE or protected KVM, it is easy to
> > generate clock values from the hypervisor, synchronized with the kernel.
>
> When you say "arch_timer" here, are you talking about the data
> structure describing the timer? Or about the actual *counter*, a
> system register provided by the HW?
>
> I'm not sure the architecture-specific details are massively relevant,
> given that this is an arch-agnostic change.
I meant the counter but happy to drop this entire paragraph and just keep the
following one!
>
> >
> > For tracing purpose, the boot clock is interesting as it doesn't stop on
> > suspend. Export it as part of the time snapshot. This will later allow
> > the hypervisor to add boot clock timestamps to its events.
>
> Isn't that the actual description of the change? By getting the boot
> time as well as the parameters to compute an increment, you allow any
> subsystem able to perform a snapshot to compute a delta from boot time
> as long as they have access to the counter source.
>
> >
> > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> >
> > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> > index fc12a9ba2c88..0fc6a61d64bd 100644
> > --- a/include/linux/timekeeping.h
> > +++ b/include/linux/timekeeping.h
> > @@ -275,18 +275,24 @@ struct ktime_timestamps {
> > * counter value
> > * @cycles: Clocksource counter value to produce the system times
> > * @real: Realtime system time
> > + * @boot: Boot time
> > * @raw: Monotonic raw system time
> > * @cs_id: Clocksource ID
> > * @clock_was_set_seq: The sequence number of clock-was-set events
> > * @cs_was_changed_seq: The sequence number of clocksource change events
> > + * @mono_shift: The monotonic clock slope shift
> > + * @mono_mult: The monotonic clock slope mult
> > */
> > struct system_time_snapshot {
> > u64 cycles;
> > ktime_t real;
> > + ktime_t boot;
> > ktime_t raw;
> > enum clocksource_ids cs_id;
> > unsigned int clock_was_set_seq;
> > u8 cs_was_changed_seq;
> > + u32 mono_shift;
> > + u32 mono_mult;
> > };
> >
> > /**
> > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> > index 2fa87dcfeda9..6d0488a555a7 100644
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> > @@ -1057,9 +1057,11 @@ noinstr time64_t __ktime_get_real_seconds(void)
> > void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> > {
> > struct timekeeper *tk = &tk_core.timekeeper;
> > + u32 mono_mult, mono_shift;
> > unsigned int seq;
> > ktime_t base_raw;
> > ktime_t base_real;
> > + ktime_t base_boot;
> > u64 nsec_raw;
> > u64 nsec_real;
> > u64 now;
> > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
> > base_real = ktime_add(tk->tkr_mono.base,
> > tk_core.timekeeper.offs_real);
> > + base_boot = ktime_add(tk->tkr_mono.base,
> > + tk_core.timekeeper.offs_boot);
> > base_raw = tk->tkr_raw.base;
> > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> > + mono_mult = tk->tkr_mono.mult;
> > + mono_shift = tk->tkr_mono.shift;
> > } while (read_seqcount_retry(&tk_core.seq, seq));
> >
> > systime_snapshot->cycles = now;
> > systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
> > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
> > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
> > + systime_snapshot->mono_shift = mono_shift;
> > + systime_snapshot->mono_mult = mono_mult;
> > }
> > EXPORT_SYMBOL_GPL(ktime_get_snapshot);
> >
>
> This looks good to me, but you should probably Cc the timekeeping
> maintainers (tglx, John Stultz, and Stephen Boyd).
Yep, my bad!
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots
2024-08-22 18:13 ` John Stultz
2024-08-22 21:41 ` Thomas Gleixner
@ 2024-09-05 13:17 ` Vincent Donnefort
1 sibling, 0 replies; 22+ messages in thread
From: Vincent Donnefort @ 2024-09-05 13:17 UTC (permalink / raw)
To: John Stultz
Cc: rostedt, mhiramat, linux-trace-kernel, maz, oliver.upton, kvmarm,
will, qperret, kernel-team, Thomas Gleixner
On Thu, Aug 22, 2024 at 11:13:11AM -0700, John Stultz wrote:
> On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > On arm64 systems, the arch timer can be accessible by both EL1 and EL2.
> > This means when running with nVHE or protected KVM, it is easy to
> > generate clock values from the hypervisor, synchronized with the kernel.
> >
> > For tracing purpose, the boot clock is interesting as it doesn't stop on
> > suspend. Export it as part of the time snapshot. This will later allow
> > the hypervisor to add boot clock timestamps to its events.
> >
> > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> >
> > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> > index fc12a9ba2c88..0fc6a61d64bd 100644
> > --- a/include/linux/timekeeping.h
> > +++ b/include/linux/timekeeping.h
> > @@ -275,18 +275,24 @@ struct ktime_timestamps {
> > * counter value
> > * @cycles: Clocksource counter value to produce the system times
> > * @real: Realtime system time
> > + * @boot: Boot time
>
> So, adding the boottime to this kernel-internal snapshot seems reasonable to me.
>
> > * @raw: Monotonic raw system time
> > * @cs_id: Clocksource ID
> > * @clock_was_set_seq: The sequence number of clock-was-set events
> > * @cs_was_changed_seq: The sequence number of clocksource change events
> > + * @mono_shift: The monotonic clock slope shift
> > + * @mono_mult: The monotonic clock slope mult
>
>
> This bit, including the mult/shift pair however, isn't well explained
> and is a little more worrying.
>
>
> > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
> > base_real = ktime_add(tk->tkr_mono.base,
> > tk_core.timekeeper.offs_real);
> > + base_boot = ktime_add(tk->tkr_mono.base,
> > + tk_core.timekeeper.offs_boot);
> > base_raw = tk->tkr_raw.base;
> > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> > + mono_mult = tk->tkr_mono.mult;
> > + mono_shift = tk->tkr_mono.shift;
> > } while (read_seqcount_retry(&tk_core.seq, seq));
> >
> > systime_snapshot->cycles = now;
> > systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
> > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
> > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
> > + systime_snapshot->mono_shift = mono_shift;
> > + systime_snapshot->mono_mult = mono_mult;
> > }
> > EXPORT_SYMBOL_GPL(ktime_get_snapshot);
>
> So this looks like you're trying to stuff kernel timekeeping internal
> values into the snapshot so you can skirt around the timekeeping
> subsystem and generate your own timestamps.
>
> This ends up duplicating logic, but in an incomplete way. For
> instance, you don't have things like ntp state, etc, so the timestamps
> you generate will not exactly match the kernel, and may have
> discontinuities. :(
>
> Now for many cases "close enough" is fine. But the difficulty is the
> expectation bar always raises, and eventually "close enough" isn't and
> we have a broken interface that has to be fixed.
>
> That said, I do get the need to have something like this is
> legitimate. There have been a number of cases where external hardware
> (PTP timestamps from NICs) or contexts (virt) are able to record
> hardware clocksource timestamps on their own, and want to be able to
> map that back to the kernel's (or maybe "a kernel's" if there are
> multiple VMs) sense of time. Sometimes even wanting to do this quite
> a bit later after the timestamp was recorded. The ktime_get_snapshot()
> logic was added in the first place for this reason.
>
> Some more aggressive approaches try to dump a bunch of the internal
> kernel timekeeping state out to userland and call it an api.
> See https://lore.kernel.org/lkml/410bbef9771ef8aa51704994a70d5965e367e2ce.camel@infradead.org/
> for a recent (and thorough) effort there.
>
> I'm very much not a fan of this approach, as it mimics older efforts
> for userspace time calculations that were done before we settled on
> VDSOs, which were very fragile and required years of keeping backwards
> compatibility logic to map the current kernel state back to separate
> structures and expensive conversions to different units that userland
> expected.
>
> The benefit with VDSO interface is while the data is exposed to
> userland, the structure is not, and the logic is still kernel
> controlled, so changes to internal state can be done without breaking
> userland.
>
> Something I have been thinking about is maybe it would be beneficial
> to rework the timekeeping core so that given a clocksource timestamp,
> it could calculate the time for that timestamp. While existing apis
> would still do a new read of the clocksource, so the timestamps would
> always increase, an old timestamp could be used to retro-calculate a
> past time. The thing that prevents this now is that the timekeeping
> core doesn't keep any history, so we can't correctly back-calculate
> times before the last state change. But potentially we could keep a
> buffer of timekeeper states associated with clocksource intervals, and
> so we could find the right state to use for a given clocksource
> timestamp. Now, this would still only work to a point, as we don't
> want to keep tons of historical state. But then with this, maybe we
> could switch to something more VDSO-like where the PTP drivers or host
> systems could request a time given a timestamp (and probably some
> clocksource id so we can sanity check everyone is using the same
> clock), and we could still provide what they want without having to
> expose all of our state.
>
> Unfortunately though, this is all hand waving and pontificating on my
> part, as it would be a large rework. But it seems something closer
> where we share opaque kernel state along with logic with proper
> syscall like APIs to do the calculations, would be a much better
> approach over just exporting more kernel state as an API.
>
> For a more short term approach, since you can't be exact outside of
> the timekeeping logic, why not interpolate from the data
> ktime_get_snapshot already provides to calculate your own sense of the
> frequency?
Understood, I shouldn't sneak out mult and shift. So for the following version,
I'll just use the boot clock value and process my "own" mult and "shift".
Thanks for having a look at the change!
>
> thanks
> -john
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2024-09-05 13:17 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-05 17:32 [RFC PATCH 00/11] Tracefs support for pKVM Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 01/11] ring-buffer: Check for empty ring-buffer with rb_num_of_entries() Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 02/11] ring-buffer: Introducing ring-buffer writer Vincent Donnefort
2024-08-06 20:39 ` Steven Rostedt
2024-08-13 14:21 ` Vincent Donnefort
2024-08-13 14:35 ` Steven Rostedt
2024-08-05 17:32 ` [RFC PATCH 03/11] ring-buffer: Expose buffer_data_page material Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 04/11] timekeeping: Export the boot clock in snapshots Vincent Donnefort
2024-08-22 9:13 ` Marc Zyngier
2024-09-05 13:04 ` Vincent Donnefort
2024-08-22 18:13 ` John Stultz
2024-08-22 21:41 ` Thomas Gleixner
2024-09-05 13:17 ` Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 05/11] KVM: arm64: Support unaligned fixmap in the nVHE hyp Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 06/11] KVM: arm64: Add clock support " Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 07/11] KVM: arm64: Add tracing support for the pKVM hyp Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 08/11] KVM: arm64: Add hyp tracing to tracefs Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 09/11] KVM: arm64: Add raw interface for hyp tracefs Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 10/11] KVM: arm64: Add support for hyp events Vincent Donnefort
2024-08-05 17:32 ` [RFC PATCH 11/11] KVM: arm64: Add kselftest for tracefs hyp tracefs Vincent Donnefort
2024-08-06 20:11 ` [RFC PATCH 00/11] Tracefs support for pKVM Steven Rostedt
2024-08-07 16:39 ` Vincent Donnefort
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).