[PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization
@ 2025-06-07  0:20 Steven Rostedt
  2025-06-09 12:21 ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2025-06-07  0:20 UTC (permalink / raw)
  To: LKML, Linux trace kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Paul E. McKenney, Jan Kiszka,
	Andreas Ziegler, Felix MOESSBAUER, Julien Flot

From: Steven Rostedt <rostedt@goodmis.org>

When faultable trace events were added, a trace event may no longer use
normal RCU to synchronize but instead used synchronize_rcu_tasks_trace().
This synchronization takes a much longer time to synchronize.

The filter logic would free the filters by calling
tracepoint_synchronize_unregister() after it unhooked the filter strings
and before freeing them. With this function now calling
synchronize_rcu_tasks_trace() this increased the time to free a filter
tremendously. On a PREEMPT_RT system, it was even more noticeable.

 # time trace-cmd record -p function sleep 1
 [..]
 real	2m29.052s
 user	0m0.244s
 sys	0m20.136s

As trace-cmd would clear out all the filters before recording, it could
take up to 2 minutes to do a recording of "sleep 1".

To find out where the issues was:

 ~# trace-cmd sqlhist -e -n sched_stack  select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta,  start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid

Which will produce the following commands (and -e will also execute them):

 echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events
 echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
 echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger

The above creates a synthetic event that creates a stack trace when a task
schedules out and records it with the time it scheduled back in. Basically
the time a task is off the CPU. It also records the state of the task when
it left the CPU (running, blocked, sleeping, etc). It also saves the comm
of the task as "comm" (needed for the next command).

~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" &&  state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger

The above creates a histogram with buckets per state, per stack, and the
value of the total time it was off the CPU for that stack trace. It filters
on tasks with "comm == trace-cmd" and only the sleeping and blocked states
(1 - sleeping, 2 - blocked).

~# trace-cmd record -p function sleep 1

~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18
{ state:          2, stack.stacktrace         __schedule+0x1545/0x3700
         schedule+0xe2/0x390
         schedule_timeout+0x175/0x200
         wait_for_completion_state+0x294/0x440
         __wait_rcu_gp+0x247/0x4f0
         synchronize_rcu_tasks_generic+0x151/0x230
         apply_subsystem_event_filter+0xa2b/0x1300
         subsystem_filter_write+0x67/0xc0
         vfs_write+0x1e2/0xeb0
         ksys_write+0xff/0x1d0
         do_syscall_64+0x7b/0x420
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
} hitcount:        237  delta:   99756288  <<--------------- Delta is 99 seconds!

Totals:
    Hits: 525
    Entries: 21
    Dropped: 0

This shows that this particular trace waited for 99 seconds on
synchronize_rcu_tasks() in apply_subsystem_event_filter().

In fact, there's a lot of places in the filter code that spends a lot of
time waiting for synchronize_rcu_tasks_trace() in order to free the
filters.

Add helper functions that will use call_rcu*() variants to asynchronously
free the filters. This brings the timings back to normal:

 # time trace-cmd record -p function sleep 1
 [..]
 real	0m14.681s
 user	0m0.335s
 sys	0m28.616s

And the histogram also shows this:

~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21
{ state:          2, stack.stacktrace         __schedule+0x1545/0x3700
         schedule+0xe2/0x390
         schedule_timeout+0x175/0x200
         wait_for_completion_state+0x294/0x440
         __wait_rcu_gp+0x247/0x4f0
         synchronize_rcu_normal+0x3db/0x5c0
         tracing_reset_online_cpus+0x8f/0x1e0
         tracing_open+0x335/0x440
         do_dentry_open+0x4c6/0x17a0
         vfs_open+0x82/0x360
         path_openat+0x1a36/0x2990
         do_filp_open+0x1c5/0x420
         do_sys_openat2+0xed/0x180
         __x64_sys_openat+0x108/0x1d0
         do_syscall_64+0x7b/0x420
} hitcount:          2  delta:      77044

Totals:
    Hits: 55
    Entries: 28
    Dropped: 0

Where the total waiting time of synchronize_rcu_tasks_trace() is 77
milliseconds.

Cc: stable@vger.kernel.org
Reported-by: "Flot, Julien" <julien.flot@siemens.com>
Tested-by: Julien Flot <julien.flot@siemens.com>
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250605161701.35f7989a@gandalf.local.home

- Fix accessing file->filter as it is tagged as __rcu.
  https://lore.kernel.org/all/202506070406.BQGH3VyM-lkp@intel.com/

 kernel/trace/trace_events_filter.c | 186 +++++++++++++++++++++--------
 1 file changed, 138 insertions(+), 48 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 2048560264bb..711520081741 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1335,22 +1335,139 @@ static void filter_free_subsystem_preds(struct trace_subsystem_dir *dir,
 	}
 }
 
+struct filter_list {
+	struct list_head	list;
+	struct event_filter	*filter;
+};
+
+struct filter_head {
+	struct list_head	list;
+	struct rcu_head		rcu;
+};
+
+
+static void free_filter_list(struct rcu_head *rhp)
+{
+	struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu);
+	struct filter_list *filter_item, *tmp;
+
+	list_for_each_entry_safe(filter_item, tmp, &filter_list->list, list) {
+		__free_filter(filter_item->filter);
+		list_del(&filter_item->list);
+		kfree(filter_item);
+	}
+	kfree(filter_list);
+}
+
+static void free_filter_list_tasks(struct rcu_head *rhp)
+{
+	call_rcu(rhp, free_filter_list);
+}
+
+/*
+ * The tracepoint_synchronize_unregister() is a double rcu call.
+ * It calls synchronize_rcu_tasks_trace() followed by synchronize_rcu().
+ * Instead of waiting for it, simply call these via the call_rcu*()
+ * variants.
+ */
+static void delay_free_filter(struct filter_head *head)
+{
+	call_rcu_tasks_trace(&head->rcu, free_filter_list_tasks);
+}
+
+static void try_delay_free_filter(struct event_filter *filter)
+{
+	struct filter_head *head;
+	struct filter_list *item;
+
+	head = kmalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto free_now;
+
+	INIT_LIST_HEAD(&head->list);
+
+	item = kmalloc(sizeof(*item), GFP_KERNEL);
+	if (!item) {
+		kfree(head);
+		goto free_now;
+	}
+
+	item->filter = filter;
+	list_add_tail(&item->list, &head->list);
+	delay_free_filter(head);
+	return;
+
+ free_now:
+	/* Make sure the filter is not being used */
+	tracepoint_synchronize_unregister();
+	__free_filter(filter);
+}
+
 static inline void __free_subsystem_filter(struct trace_event_file *file)
 {
 	__free_filter(file->filter);
 	file->filter = NULL;
 }
 
+static inline void event_set_filter(struct trace_event_file *file,
+				    struct event_filter *filter)
+{
+	rcu_assign_pointer(file->filter, filter);
+}
+
+static inline void event_clear_filter(struct trace_event_file *file)
+{
+	RCU_INIT_POINTER(file->filter, NULL);
+}
+
 static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
-					  struct trace_array *tr)
+					  struct trace_array *tr,
+					  struct event_filter *filter)
 {
 	struct trace_event_file *file;
+	struct filter_head *head;
+	struct filter_list *item;
+
+	head = kmalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto free_now;
+
+	INIT_LIST_HEAD(&head->list);
+
+	item = kmalloc(sizeof(*item), GFP_KERNEL);
+	if (!item) {
+		kfree(head);
+		goto free_now;
+	}
+
+	item->filter = filter;
+	list_add_tail(&item->list, &head->list);
 
 	list_for_each_entry(file, &tr->events, list) {
 		if (file->system != dir)
 			continue;
-		__free_subsystem_filter(file);
+		item = kmalloc(sizeof(*item), GFP_KERNEL);
+		if (!item)
+			goto free_now;
+		item->filter = event_filter(file);
+		list_add_tail(&item->list, &head->list);
+		event_clear_filter(file);
 	}
+
+	delay_free_filter(head);
+	return;
+ free_now:
+	tracepoint_synchronize_unregister();
+
+	if (head)
+		free_filter_list(&head->rcu);
+
+	list_for_each_entry(file, &tr->events, list) {
+		if (file->system != dir || !file->filter)
+			continue;
+		__free_filter(file->filter);
+	}
+	__free_filter(filter);
 }
 
 int filter_assign_type(const char *type)
@@ -2120,22 +2237,6 @@ static inline void event_set_filtered_flag(struct trace_event_file *file)
 		trace_buffered_event_enable();
 }
 
-static inline void event_set_filter(struct trace_event_file *file,
-				    struct event_filter *filter)
-{
-	rcu_assign_pointer(file->filter, filter);
-}
-
-static inline void event_clear_filter(struct trace_event_file *file)
-{
-	RCU_INIT_POINTER(file->filter, NULL);
-}
-
-struct filter_list {
-	struct list_head	list;
-	struct event_filter	*filter;
-};
-
 static int process_system_preds(struct trace_subsystem_dir *dir,
 				struct trace_array *tr,
 				struct filter_parse_error *pe,
@@ -2144,11 +2245,16 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
 	struct trace_event_file *file;
 	struct filter_list *filter_item;
 	struct event_filter *filter = NULL;
-	struct filter_list *tmp;
-	LIST_HEAD(filter_list);
+	struct filter_head *filter_list;
 	bool fail = true;
 	int err;
 
+	filter_list = kmalloc(sizeof(*filter_list), GFP_KERNEL);
+	if (!filter_list)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&filter_list->list);
+
 	list_for_each_entry(file, &tr->events, list) {
 
 		if (file->system != dir)
@@ -2175,7 +2281,7 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
 		if (!filter_item)
 			goto fail_mem;
 
-		list_add_tail(&filter_item->list, &filter_list);
+		list_add_tail(&filter_item->list, &filter_list->list);
 		/*
 		 * Regardless of if this returned an error, we still
 		 * replace the filter for the call.
@@ -2195,31 +2301,22 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
 	 * Do a synchronize_rcu() and to ensure all calls are
 	 * done with them before we free them.
 	 */
-	tracepoint_synchronize_unregister();
-	list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
-		__free_filter(filter_item->filter);
-		list_del(&filter_item->list);
-		kfree(filter_item);
-	}
+	delay_free_filter(filter_list);
 	return 0;
  fail:
 	/* No call succeeded */
-	list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
-		list_del(&filter_item->list);
-		kfree(filter_item);
-	}
+	free_filter_list(&filter_list->rcu);
 	parse_error(pe, FILT_ERR_BAD_SUBSYS_FILTER, 0);
 	return -EINVAL;
  fail_mem:
 	__free_filter(filter);
+
 	/* If any call succeeded, we still need to sync */
 	if (!fail)
-		tracepoint_synchronize_unregister();
-	list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
-		__free_filter(filter_item->filter);
-		list_del(&filter_item->list);
-		kfree(filter_item);
-	}
+		delay_free_filter(filter_list);
+	else
+		free_filter_list(&filter_list->rcu);
+
 	return -ENOMEM;
 }
 
@@ -2361,9 +2458,7 @@ int apply_event_filter(struct trace_event_file *file, char *filter_string)
 
 		event_clear_filter(file);
 
-		/* Make sure the filter is not being used */
-		tracepoint_synchronize_unregister();
-		__free_filter(filter);
+		try_delay_free_filter(filter);
 
 		return 0;
 	}
@@ -2387,11 +2482,8 @@ int apply_event_filter(struct trace_event_file *file, char *filter_string)
 
 		event_set_filter(file, filter);
 
-		if (tmp) {
-			/* Make sure the call is done with the filter */
-			tracepoint_synchronize_unregister();
-			__free_filter(tmp);
-		}
+		if (tmp)
+			try_delay_free_filter(tmp);
 	}
 
 	return err;
@@ -2417,9 +2509,7 @@ int apply_subsystem_event_filter(struct trace_subsystem_dir *dir,
 		filter = system->filter;
 		system->filter = NULL;
 		/* Ensure all filters are no longer used */
-		tracepoint_synchronize_unregister();
-		filter_free_subsystem_filters(dir, tr);
-		__free_filter(filter);
+		filter_free_subsystem_filters(dir, tr, filter);
 		return 0;
 	}
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization
  2025-06-07  0:20 [PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization Steven Rostedt
@ 2025-06-09 12:21 ` Paul E. McKenney
  2025-06-09 15:39   ` Steven Rostedt
  0 siblings, 1 reply; 4+ messages in thread
From: Paul E. McKenney @ 2025-06-09 12:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Jan Kiszka, Andreas Ziegler, Felix MOESSBAUER, Julien Flot

On Fri, Jun 06, 2025 at 08:20:20PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When faultable trace events were added, a trace event may no longer use
> normal RCU to synchronize but instead used synchronize_rcu_tasks_trace().
> This synchronization takes a much longer time to synchronize.
> 
> The filter logic would free the filters by calling
> tracepoint_synchronize_unregister() after it unhooked the filter strings
> and before freeing them. With this function now calling
> synchronize_rcu_tasks_trace() this increased the time to free a filter
> tremendously. On a PREEMPT_RT system, it was even more noticeable.
> 
>  # time trace-cmd record -p function sleep 1
>  [..]
>  real	2m29.052s
>  user	0m0.244s
>  sys	0m20.136s
> 
> As trace-cmd would clear out all the filters before recording, it could
> take up to 2 minutes to do a recording of "sleep 1".
> 
> To find out where the issues was:
> 
>  ~# trace-cmd sqlhist -e -n sched_stack  select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta,  start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid
> 
> Which will produce the following commands (and -e will also execute them):
> 
>  echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events
>  echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
>  echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
> 
> The above creates a synthetic event that creates a stack trace when a task
> schedules out and records it with the time it scheduled back in. Basically
> the time a task is off the CPU. It also records the state of the task when
> it left the CPU (running, blocked, sleeping, etc). It also saves the comm
> of the task as "comm" (needed for the next command).
> 
> ~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" &&  state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger
> 
> The above creates a histogram with buckets per state, per stack, and the
> value of the total time it was off the CPU for that stack trace. It filters
> on tasks with "comm == trace-cmd" and only the sleeping and blocked states
> (1 - sleeping, 2 - blocked).
> 
> ~# trace-cmd record -p function sleep 1
> 
> ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18
> { state:          2, stack.stacktrace         __schedule+0x1545/0x3700
>          schedule+0xe2/0x390
>          schedule_timeout+0x175/0x200
>          wait_for_completion_state+0x294/0x440
>          __wait_rcu_gp+0x247/0x4f0
>          synchronize_rcu_tasks_generic+0x151/0x230
>          apply_subsystem_event_filter+0xa2b/0x1300
>          subsystem_filter_write+0x67/0xc0
>          vfs_write+0x1e2/0xeb0
>          ksys_write+0xff/0x1d0
>          do_syscall_64+0x7b/0x420
>          entry_SYSCALL_64_after_hwframe+0x76/0x7e
> } hitcount:        237  delta:   99756288  <<--------------- Delta is 99 seconds!
> 
> Totals:
>     Hits: 525
>     Entries: 21
>     Dropped: 0
> 
> This shows that this particular trace waited for 99 seconds on
> synchronize_rcu_tasks() in apply_subsystem_event_filter().
> 
> In fact, there's a lot of places in the filter code that spends a lot of
> time waiting for synchronize_rcu_tasks_trace() in order to free the
> filters.
> 
> Add helper functions that will use call_rcu*() variants to asynchronously
> free the filters. This brings the timings back to normal:
> 
>  # time trace-cmd record -p function sleep 1
>  [..]
>  real	0m14.681s
>  user	0m0.335s
>  sys	0m28.616s
> 
> And the histogram also shows this:
> 
> ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21
> { state:          2, stack.stacktrace         __schedule+0x1545/0x3700
>          schedule+0xe2/0x390
>          schedule_timeout+0x175/0x200
>          wait_for_completion_state+0x294/0x440
>          __wait_rcu_gp+0x247/0x4f0
>          synchronize_rcu_normal+0x3db/0x5c0
>          tracing_reset_online_cpus+0x8f/0x1e0
>          tracing_open+0x335/0x440
>          do_dentry_open+0x4c6/0x17a0
>          vfs_open+0x82/0x360
>          path_openat+0x1a36/0x2990
>          do_filp_open+0x1c5/0x420
>          do_sys_openat2+0xed/0x180
>          __x64_sys_openat+0x108/0x1d0
>          do_syscall_64+0x7b/0x420
> } hitcount:          2  delta:      77044
> 
> Totals:
>     Hits: 55
>     Entries: 28
>     Dropped: 0
> 
> Where the total waiting time of synchronize_rcu_tasks_trace() is 77
> milliseconds.

Nice!!!

One concern below.

> Cc: stable@vger.kernel.org
> Reported-by: "Flot, Julien" <julien.flot@siemens.com>
> Tested-by: Julien Flot <julien.flot@siemens.com>
> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
> Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v1: https://lore.kernel.org/20250605161701.35f7989a@gandalf.local.home
> 
> - Fix accessing file->filter as it is tagged as __rcu.
>   https://lore.kernel.org/all/202506070406.BQGH3VyM-lkp@intel.com/
> 
>  kernel/trace/trace_events_filter.c | 186 +++++++++++++++++++++--------
>  1 file changed, 138 insertions(+), 48 deletions(-)
> 
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 2048560264bb..711520081741 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -1335,22 +1335,139 @@ static void filter_free_subsystem_preds(struct trace_subsystem_dir *dir,
>  	}
>  }
>  
> +struct filter_list {
> +	struct list_head	list;
> +	struct event_filter	*filter;
> +};
> +
> +struct filter_head {
> +	struct list_head	list;
> +	struct rcu_head		rcu;
> +};
> +
> +
> +static void free_filter_list(struct rcu_head *rhp)
> +{
> +	struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu);
> +	struct filter_list *filter_item, *tmp;
> +
> +	list_for_each_entry_safe(filter_item, tmp, &filter_list->list, list) {
> +		__free_filter(filter_item->filter);
> +		list_del(&filter_item->list);
> +		kfree(filter_item);
> +	}

If the series of synchronize_rcu() calls was taking 99 seconds,
we might well be going through the above loop quite a few times, all
within a bh-disabled region of code.  For example, the calls from both
filter_free_subsystem_filters() and process_system_preds() look like
they can queue a very large number of entries.

So...

> +	kfree(filter_list);
> +}
> +
> +static void free_filter_list_tasks(struct rcu_head *rhp)
> +{
> +	call_rcu(rhp, free_filter_list);

...why not use queue_rcu_work() in place of call_rcu() here, thus avoiding
possible response-time issues?

Or is there something that I am missing that limits the length of the
list passed to free_filter_list()?

							Thanx, Paul

> +}
> +
> +/*
> + * The tracepoint_synchronize_unregister() is a double rcu call.
> + * It calls synchronize_rcu_tasks_trace() followed by synchronize_rcu().
> + * Instead of waiting for it, simply call these via the call_rcu*()
> + * variants.
> + */
> +static void delay_free_filter(struct filter_head *head)
> +{
> +	call_rcu_tasks_trace(&head->rcu, free_filter_list_tasks);
> +}
> +
> +static void try_delay_free_filter(struct event_filter *filter)
> +{
> +	struct filter_head *head;
> +	struct filter_list *item;
> +
> +	head = kmalloc(sizeof(*head), GFP_KERNEL);
> +	if (!head)
> +		goto free_now;
> +
> +	INIT_LIST_HEAD(&head->list);
> +
> +	item = kmalloc(sizeof(*item), GFP_KERNEL);
> +	if (!item) {
> +		kfree(head);
> +		goto free_now;
> +	}
> +
> +	item->filter = filter;
> +	list_add_tail(&item->list, &head->list);
> +	delay_free_filter(head);
> +	return;
> +
> + free_now:
> +	/* Make sure the filter is not being used */
> +	tracepoint_synchronize_unregister();
> +	__free_filter(filter);
> +}
> +
>  static inline void __free_subsystem_filter(struct trace_event_file *file)
>  {
>  	__free_filter(file->filter);
>  	file->filter = NULL;
>  }
>  
> +static inline void event_set_filter(struct trace_event_file *file,
> +				    struct event_filter *filter)
> +{
> +	rcu_assign_pointer(file->filter, filter);
> +}
> +
> +static inline void event_clear_filter(struct trace_event_file *file)
> +{
> +	RCU_INIT_POINTER(file->filter, NULL);
> +}
> +
>  static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
> -					  struct trace_array *tr)
> +					  struct trace_array *tr,
> +					  struct event_filter *filter)
>  {
>  	struct trace_event_file *file;
> +	struct filter_head *head;
> +	struct filter_list *item;
> +
> +	head = kmalloc(sizeof(*head), GFP_KERNEL);
> +	if (!head)
> +		goto free_now;
> +
> +	INIT_LIST_HEAD(&head->list);
> +
> +	item = kmalloc(sizeof(*item), GFP_KERNEL);
> +	if (!item) {
> +		kfree(head);
> +		goto free_now;
> +	}
> +
> +	item->filter = filter;
> +	list_add_tail(&item->list, &head->list);
>  
>  	list_for_each_entry(file, &tr->events, list) {
>  		if (file->system != dir)
>  			continue;
> -		__free_subsystem_filter(file);
> +		item = kmalloc(sizeof(*item), GFP_KERNEL);
> +		if (!item)
> +			goto free_now;
> +		item->filter = event_filter(file);
> +		list_add_tail(&item->list, &head->list);
> +		event_clear_filter(file);
>  	}
> +
> +	delay_free_filter(head);
> +	return;
> + free_now:
> +	tracepoint_synchronize_unregister();
> +
> +	if (head)
> +		free_filter_list(&head->rcu);
> +
> +	list_for_each_entry(file, &tr->events, list) {
> +		if (file->system != dir || !file->filter)
> +			continue;
> +		__free_filter(file->filter);
> +	}
> +	__free_filter(filter);
>  }
>  
>  int filter_assign_type(const char *type)
> @@ -2120,22 +2237,6 @@ static inline void event_set_filtered_flag(struct trace_event_file *file)
>  		trace_buffered_event_enable();
>  }
>  
> -static inline void event_set_filter(struct trace_event_file *file,
> -				    struct event_filter *filter)
> -{
> -	rcu_assign_pointer(file->filter, filter);
> -}
> -
> -static inline void event_clear_filter(struct trace_event_file *file)
> -{
> -	RCU_INIT_POINTER(file->filter, NULL);
> -}
> -
> -struct filter_list {
> -	struct list_head	list;
> -	struct event_filter	*filter;
> -};
> -
>  static int process_system_preds(struct trace_subsystem_dir *dir,
>  				struct trace_array *tr,
>  				struct filter_parse_error *pe,
> @@ -2144,11 +2245,16 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
>  	struct trace_event_file *file;
>  	struct filter_list *filter_item;
>  	struct event_filter *filter = NULL;
> -	struct filter_list *tmp;
> -	LIST_HEAD(filter_list);
> +	struct filter_head *filter_list;
>  	bool fail = true;
>  	int err;
>  
> +	filter_list = kmalloc(sizeof(*filter_list), GFP_KERNEL);
> +	if (!filter_list)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&filter_list->list);
> +
>  	list_for_each_entry(file, &tr->events, list) {
>  
>  		if (file->system != dir)
> @@ -2175,7 +2281,7 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
>  		if (!filter_item)
>  			goto fail_mem;
>  
> -		list_add_tail(&filter_item->list, &filter_list);
> +		list_add_tail(&filter_item->list, &filter_list->list);
>  		/*
>  		 * Regardless of if this returned an error, we still
>  		 * replace the filter for the call.
> @@ -2195,31 +2301,22 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
>  	 * Do a synchronize_rcu() and to ensure all calls are
>  	 * done with them before we free them.
>  	 */
> -	tracepoint_synchronize_unregister();
> -	list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
> -		__free_filter(filter_item->filter);
> -		list_del(&filter_item->list);
> -		kfree(filter_item);
> -	}
> +	delay_free_filter(filter_list);
>  	return 0;
>   fail:
>  	/* No call succeeded */
> -	list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
> -		list_del(&filter_item->list);
> -		kfree(filter_item);
> -	}
> +	free_filter_list(&filter_list->rcu);
>  	parse_error(pe, FILT_ERR_BAD_SUBSYS_FILTER, 0);
>  	return -EINVAL;
>   fail_mem:
>  	__free_filter(filter);
> +
>  	/* If any call succeeded, we still need to sync */
>  	if (!fail)
> -		tracepoint_synchronize_unregister();
> -	list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
> -		__free_filter(filter_item->filter);
> -		list_del(&filter_item->list);
> -		kfree(filter_item);
> -	}
> +		delay_free_filter(filter_list);
> +	else
> +		free_filter_list(&filter_list->rcu);
> +
>  	return -ENOMEM;
>  }
>  
> @@ -2361,9 +2458,7 @@ int apply_event_filter(struct trace_event_file *file, char *filter_string)
>  
>  		event_clear_filter(file);
>  
> -		/* Make sure the filter is not being used */
> -		tracepoint_synchronize_unregister();
> -		__free_filter(filter);
> +		try_delay_free_filter(filter);
>  
>  		return 0;
>  	}
> @@ -2387,11 +2482,8 @@ int apply_event_filter(struct trace_event_file *file, char *filter_string)
>  
>  		event_set_filter(file, filter);
>  
> -		if (tmp) {
> -			/* Make sure the call is done with the filter */
> -			tracepoint_synchronize_unregister();
> -			__free_filter(tmp);
> -		}
> +		if (tmp)
> +			try_delay_free_filter(tmp);
>  	}
>  
>  	return err;
> @@ -2417,9 +2509,7 @@ int apply_subsystem_event_filter(struct trace_subsystem_dir *dir,
>  		filter = system->filter;
>  		system->filter = NULL;
>  		/* Ensure all filters are no longer used */
> -		tracepoint_synchronize_unregister();
> -		filter_free_subsystem_filters(dir, tr);
> -		__free_filter(filter);
> +		filter_free_subsystem_filters(dir, tr, filter);
>  		return 0;
>  	}
>  
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization
  2025-06-09 12:21 ` Paul E. McKenney
@ 2025-06-09 15:39   ` Steven Rostedt
  2025-06-09 16:09     ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2025-06-09 15:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Jan Kiszka, Andreas Ziegler, Felix MOESSBAUER, Julien Flot

On Mon, 9 Jun 2025 05:21:19 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > +static void free_filter_list(struct rcu_head *rhp)
> > +{
> > +	struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu);
> > +	struct filter_list *filter_item, *tmp;
> > +
> > +	list_for_each_entry_safe(filter_item, tmp, &filter_list->list, list) {
> > +		__free_filter(filter_item->filter);
> > +		list_del(&filter_item->list);
> > +		kfree(filter_item);
> > +	}  
> 
> If the series of synchronize_rcu() calls was taking 99 seconds,
> we might well be going through the above loop quite a few times, all
> within a bh-disabled region of code.  For example, the calls from both
> filter_free_subsystem_filters() and process_system_preds() look like
> they can queue a very large number of entries.

The iteration is for each event in a subsystem.

Seeing what that is:

~# for sys in /sys/kernel/tracing/events/* ; do cnt=`ls -F $sys |grep '/$' | wc -l`; echo $cnt $sys; done | sort -n | tail
24 /sys/kernel/tracing/events/tcp
26 /sys/kernel/tracing/events/xen
28 /sys/kernel/tracing/events/sched
32 /sys/kernel/tracing/events/writeback
33 /sys/kernel/tracing/events/libata
34 /sys/kernel/tracing/events/irq_vectors
55 /sys/kernel/tracing/events/xhci-hcd
92 /sys/kernel/tracing/events/kvm
113 /sys/kernel/tracing/events/ext4
724 /sys/kernel/tracing/events/syscalls

So the most loops that can happen is 724 on the syscalls plus the system
filter. And adding in printks to count the loops, I verified that:

[..]
[  100.019209] LOOPED 4 times
[  100.020033] LOOPED 5 times
[  100.020791] LOOPED 8 times
[  100.021583] LOOPED 10 times
[  100.022239] LOOPED 2 times
[  100.110515] LOOPED 725 times
[  100.111520] LOOPED 4 times
[  100.112889] LOOPED 25 times
[  100.114416] LOOPED 6 times
[  100.115950] LOOPED 4 times
[  100.117557] LOOPED 7 times
[..]

(724 event filters and one system event filter = 725 filters).

If you feel that's too big, we could move it to a work queue.

As more system calls get added, that number will grow. But it is limited to
the number of system calls.

All the event systems are way fewer than that.

> 
> So...
> 
> > +	kfree(filter_list);
> > +}
> > +
> > +static void free_filter_list_tasks(struct rcu_head *rhp)
> > +{
> > +	call_rcu(rhp, free_filter_list);  
> 
> ...why not use queue_rcu_work() in place of call_rcu() here, thus avoiding
> possible response-time issues?
> 
> Or is there something that I am missing that limits the length of the
> list passed to free_filter_list()?

Just the number of events in a system.

I could switch to a work queue but I don't think it needs to go as a fix,
do you? That is, it can wait till the next merge window. Or is this going
to affect things now?

-- Steve

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization
  2025-06-09 15:39   ` Steven Rostedt
@ 2025-06-09 16:09     ` Paul E. McKenney
  0 siblings, 0 replies; 4+ messages in thread
From: Paul E. McKenney @ 2025-06-09 16:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Jan Kiszka, Andreas Ziegler, Felix MOESSBAUER, Julien Flot

On Mon, Jun 09, 2025 at 11:39:55AM -0400, Steven Rostedt wrote:
> On Mon, 9 Jun 2025 05:21:19 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > +static void free_filter_list(struct rcu_head *rhp)
> > > +{
> > > +	struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu);
> > > +	struct filter_list *filter_item, *tmp;
> > > +
> > > +	list_for_each_entry_safe(filter_item, tmp, &filter_list->list, list) {
> > > +		__free_filter(filter_item->filter);
> > > +		list_del(&filter_item->list);
> > > +		kfree(filter_item);
> > > +	}  
> > 
> > If the series of synchronize_rcu() calls was taking 99 seconds,
> > we might well be going through the above loop quite a few times, all
> > within a bh-disabled region of code.  For example, the calls from both
> > filter_free_subsystem_filters() and process_system_preds() look like
> > they can queue a very large number of entries.
> 
> The iteration is for each event in a subsystem.
> 
> Seeing what that is:
> 
> ~# for sys in /sys/kernel/tracing/events/* ; do cnt=`ls -F $sys |grep '/$' | wc -l`; echo $cnt $sys; done | sort -n | tail
> 24 /sys/kernel/tracing/events/tcp
> 26 /sys/kernel/tracing/events/xen
> 28 /sys/kernel/tracing/events/sched
> 32 /sys/kernel/tracing/events/writeback
> 33 /sys/kernel/tracing/events/libata
> 34 /sys/kernel/tracing/events/irq_vectors
> 55 /sys/kernel/tracing/events/xhci-hcd
> 92 /sys/kernel/tracing/events/kvm
> 113 /sys/kernel/tracing/events/ext4
> 724 /sys/kernel/tracing/events/syscalls
> 
> So the most loops that can happen is 724 on the syscalls plus the system
> filter. And adding in printks to count the loops, I verified that:
> 
> [..]
> [  100.019209] LOOPED 4 times
> [  100.020033] LOOPED 5 times
> [  100.020791] LOOPED 8 times
> [  100.021583] LOOPED 10 times
> [  100.022239] LOOPED 2 times
> [  100.110515] LOOPED 725 times
> [  100.111520] LOOPED 4 times
> [  100.112889] LOOPED 25 times
> [  100.114416] LOOPED 6 times
> [  100.115950] LOOPED 4 times
> [  100.117557] LOOPED 7 times
> [..]
> 
> (724 event filters and one system event filter = 725 filters).
> 
> If you feel that's too big, we could move it to a work queue.

That does sound like a bit more than would be good.

> As more system calls get added, that number will grow. But it is limited to
> the number of system calls.
> 
> All the event systems are way fewer than that.
> 
> > 
> > So...
> > 
> > > +	kfree(filter_list);
> > > +}
> > > +
> > > +static void free_filter_list_tasks(struct rcu_head *rhp)
> > > +{
> > > +	call_rcu(rhp, free_filter_list);  
> > 
> > ...why not use queue_rcu_work() in place of call_rcu() here, thus avoiding
> > possible response-time issues?
> > 
> > Or is there something that I am missing that limits the length of the
> > list passed to free_filter_list()?
> 
> Just the number of events in a system.
> 
> I could switch to a work queue but I don't think it needs to go as a fix,
> do you? That is, it can wait till the next merge window. Or is this going
> to affect things now?

Thank youi, and the next merge window should be fine.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-06-09 16:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-07  0:20 [PATCH v2] tracing: Fix regression of filter waiting a long time on RCU synchronization Steven Rostedt
2025-06-09 12:21 ` Paul E. McKenney
2025-06-09 15:39   ` Steven Rostedt
2025-06-09 16:09     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).