public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [for-linus][PATCH 0/4] tracing: Fixes for v6.6-rc3
@ 2023-09-30 20:32 Steven Rostedt
  2023-09-30 20:32 ` [for-linus][PATCH 1/4] ring-buffer: Update "shortest_full" in polling Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Steven Rostedt @ 2023-09-30 20:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Mark Rutland, Andrew Morton

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1557 bytes --]

Tracing fixes for v6.6-rc3:

- Make sure 32 bit applications using user events have aligned access when
  running on a 64 bit kernel.

- Add cond_resched in the loop that handles converting enums in print_fmt
  string is trace events.

- Fix premature wake ups of polling processes in the tracing ring buffer. When
  a task polls waiting for a percentage of the ring buffer to be filled, the
  writer still will wake it up at every event. Add the polling's percentage to
  the "shortest_full" list to tell the writer when to wake it up.

- For eventfs dir lookups on dynamic events, an event system's only event could
  be removed, leaving its dentry with no children. This is totally legitimate.
  But on eventfs_release() it must not access the children array, as it is only
  allocated when the dentry has children.

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace/urgent

Head SHA1: 2598bd3ca8dcf5bbca1161ee5b271b432398da37


Beau Belgrave (1):
      tracing/user_events: Align set_bit() address for all archs

Clément Léger (1):
      tracing: relax trace_event_eval_update() execution with cond_resched()

Steven Rostedt (Google) (2):
      ring-buffer: Update "shortest_full" in polling
      eventfs: Test for dentries array allocated in eventfs_release()

----
 fs/tracefs/event_inode.c         |  2 +-
 kernel/trace/ring_buffer.c       |  3 +++
 kernel/trace/trace_events.c      |  1 +
 kernel/trace/trace_events_user.c | 58 +++++++++++++++++++++++++++++++++++-----
 4 files changed, 56 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [for-linus][PATCH 1/4] ring-buffer: Update "shortest_full" in polling
  2023-09-30 20:32 [for-linus][PATCH 0/4] tracing: Fixes for v6.6-rc3 Steven Rostedt
@ 2023-09-30 20:32 ` Steven Rostedt
  2023-09-30 20:32 ` [for-linus][PATCH 4/4] eventfs: Test for dentries array allocated in eventfs_release() Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2023-09-30 20:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Andrew Morton, stable,
	Julia Lawall

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

It was discovered that the ring buffer polling was incorrectly stating
that read would not block, but that's because polling did not take into
account that reads will block if the "buffer-percent" was set. Instead,
the ring buffer polling would say reads would not block if there was any
data in the ring buffer. This was incorrect behavior from a user space
point of view. This was fixed by commit 42fb0a1e84ff by having the polling
code check if the ring buffer had more data than what the user specified
"buffer percent" had.

The problem now is that the polling code did not register itself to the
writer that it wanted to wait for a specific "full" value of the ring
buffer. The result was that the writer would wake the polling waiter
whenever there was a new event. The polling waiter would then wake up, see
that there's not enough data in the ring buffer to notify user space and
then go back to sleep. The next event would wake it up again.

Before the polling fix was added, the code would wake up around 100 times
for a hackbench 30 benchmark. After the "fix", due to the constant waking
of the writer, it would wake up over 11,0000 times! It would never leave
the kernel, so the user space behavior was still "correct", but this
definitely is not the desired effect.

To fix this, have the polling code add what it's waiting for to the
"shortest_full" variable, to tell the writer not to wake it up if the
buffer is not as full as it expects to be.

Note, after this fix, it appears that the waiter is now woken up around 2x
the times it was before (~200). This is a tremendous improvement from the
11,000 times, but I will need to spend some time to see why polling is
more aggressive in its wakeups than the read blocking code.

Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Tested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28daf0ce95c5..515cafdb18d9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1137,6 +1137,9 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
 	if (full) {
 		poll_wait(filp, &work->full_waiters, poll_table);
 		work->full_waiters_pending = true;
+		if (!cpu_buffer->shortest_full ||
+		    cpu_buffer->shortest_full > full)
+			cpu_buffer->shortest_full = full;
 	} else {
 		poll_wait(filp, &work->waiters, poll_table);
 		work->waiters_pending = true;
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [for-linus][PATCH 4/4] eventfs: Test for dentries array allocated in eventfs_release()
  2023-09-30 20:32 [for-linus][PATCH 0/4] tracing: Fixes for v6.6-rc3 Steven Rostedt
  2023-09-30 20:32 ` [for-linus][PATCH 1/4] ring-buffer: Update "shortest_full" in polling Steven Rostedt
@ 2023-09-30 20:32 ` Steven Rostedt
       [not found] ` <20230930203821.632661639@goodmis.org>
       [not found] ` <20230930203821.843544914@goodmis.org>
  3 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2023-09-30 20:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Masami Hiramatsu, Mark Rutland, Andrew Morton

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The dcache_dir_open_wrapper() could be called when a dynamic event is
being deleted leaving a dentry with no children. In this case the
dlist->dentries array will never be allocated. This needs to be checked
for in eventfs_release(), otherwise it will trigger a NULL pointer
dereference.

Link: https://lore.kernel.org/linux-trace-kernel/20230930090106.1c3164e9@rorschach.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: ef36b4f92868 ("eventfs: Remember what dentries were created on dir open")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 fs/tracefs/event_inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5f1714089884..8c8d64e76103 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -421,7 +421,7 @@ static int eventfs_release(struct inode *inode, struct file *file)
 	if (WARN_ON_ONCE(!dlist))
 		return -EINVAL;
 
-	for (i = 0; dlist->dentries[i]; i++) {
+	for (i = 0; dlist->dentries && dlist->dentries[i]; i++) {
 		dput(dlist->dentries[i]);
 	}
 
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [for-linus][PATCH 2/4] tracing: relax trace_event_eval_update() execution with cond_resched()
       [not found] ` <20230930203821.632661639@goodmis.org>
@ 2023-09-30 20:41   ` Steven Rostedt
  0 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2023-09-30 20:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Andrew Morton,
	Clément Léger, Atish Patra


[ The problem with sending email with quilt. It doesn't handle UTF-8 well :-/ ]

On Sat, 30 Sep 2023 16:32:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: =?UTF-8?q?Cl=C3=A9ment=20L=C3=A9ger?= <cleger@rivosinc.com>
> 
> When kernel is compiled without preemption, the eval_map_work_func()
> (which calls trace_event_eval_update()) will not be preempted up to its
> complete execution. This can actually cause a problem since if another
> CPU call stop_machine(), the call will have to wait for the
> eval_map_work_func() function to finish executing in the workqueue
> before being able to be scheduled. This problem was observe on a SMP
> system at boot time, when the CPU calling the initcalls executed
> clocksource_done_booting() which in the end calls stop_machine(). We
> observed a 1 second delay because one CPU was executing
> eval_map_work_func() and was not preempted by the stop_machine() task.
> 
> Adding a call to cond_resched() in trace_event_eval_update() allows
> other tasks to be executed and thus continue working asynchronously
> like before without blocking any pending task at boot time.
> 
> Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com
> 
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Signed-off-by: Clément Léger <cleger@rivosinc.com>
> Tested-by: Atish Patra <atishp@rivosinc.com>
> Reviewed-by: Atish Patra <atishp@rivosinc.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  kernel/trace/trace_events.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 91951d038ba4..f49d6ddb6342 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2770,6 +2770,7 @@ void trace_event_eval_update(struct trace_eval_map **map, int len)
>  				update_event_fields(call, map[i]);
>  			}
>  		}
> +		cond_resched();
>  	}
>  	up_write(&trace_event_sem);
>  }


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [for-linus][PATCH 3/4] tracing/user_events: Align set_bit() address for all archs
       [not found] ` <20230930203821.843544914@goodmis.org>
@ 2023-09-30 20:41   ` Steven Rostedt
  0 siblings, 0 replies; 5+ messages in thread
From: Steven Rostedt @ 2023-09-30 20:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Andrew Morton, stable,
	Clément Léger, Beau Belgrave


[ Replying to show this patch that got dropped from the mailing list ]

On Sat, 30 Sep 2023 16:32:16 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Beau Belgrave <beaub@linux.microsoft.com>
> 
> All architectures should use a long aligned address passed to set_bit().
> User processes can pass either a 32-bit or 64-bit sized value to be
> updated when tracing is enabled when on a 64-bit kernel. Both cases are
> ensured to be naturally aligned, however, that is not enough. The
> address must be long aligned without affecting checks on the value
> within the user process which require different adjustments for the bit
> for little and big endian CPUs.
> 
> Add a compat flag to user_event_enabler that indicates when a 32-bit
> value is being used on a 64-bit kernel. Long align addresses and correct
> the bit to be used by set_bit() to account for this alignment. Ensure
> compat flags are copied during forks and used during deletion clears.
> 
> Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com
> Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/
> 
> Cc: stable@vger.kernel.org
> Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement")
> Reported-by: Clément Léger <cleger@rivosinc.com>
> Suggested-by: Clément Léger <cleger@rivosinc.com>
> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  kernel/trace/trace_events_user.c | 58 ++++++++++++++++++++++++++++----
>  1 file changed, 51 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
> index 6f046650e527..b87f41187c6a 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -127,8 +127,13 @@ struct user_event_enabler {
>  /* Bit 7 is for freeing status of enablement */
>  #define ENABLE_VAL_FREEING_BIT 7
>  
> -/* Only duplicate the bit value */
> -#define ENABLE_VAL_DUP_MASK ENABLE_VAL_BIT_MASK
> +/* Bit 8 is for marking 32-bit on 64-bit */
> +#define ENABLE_VAL_32_ON_64_BIT 8
> +
> +#define ENABLE_VAL_COMPAT_MASK (1 << ENABLE_VAL_32_ON_64_BIT)
> +
> +/* Only duplicate the bit and compat values */
> +#define ENABLE_VAL_DUP_MASK (ENABLE_VAL_BIT_MASK | ENABLE_VAL_COMPAT_MASK)
>  
>  #define ENABLE_BITOPS(e) (&(e)->values)
>  
> @@ -174,6 +179,30 @@ struct user_event_validator {
>  	int			flags;
>  };
>  
> +static inline void align_addr_bit(unsigned long *addr, int *bit,
> +				  unsigned long *flags)
> +{
> +	if (IS_ALIGNED(*addr, sizeof(long))) {
> +#ifdef __BIG_ENDIAN
> +		/* 32 bit on BE 64 bit requires a 32 bit offset when aligned. */
> +		if (test_bit(ENABLE_VAL_32_ON_64_BIT, flags))
> +			*bit += 32;
> +#endif
> +		return;
> +	}
> +
> +	*addr = ALIGN_DOWN(*addr, sizeof(long));
> +
> +	/*
> +	 * We only support 32 and 64 bit values. The only time we need
> +	 * to align is a 32 bit value on a 64 bit kernel, which on LE
> +	 * is always 32 bits, and on BE requires no change when unaligned.
> +	 */
> +#ifdef __LITTLE_ENDIAN
> +	*bit += 32;
> +#endif
> +}
> +
>  typedef void (*user_event_func_t) (struct user_event *user, struct iov_iter *i,
>  				   void *tpdata, bool *faulted);
>  
> @@ -482,6 +511,7 @@ static int user_event_enabler_write(struct user_event_mm *mm,
>  	unsigned long *ptr;
>  	struct page *page;
>  	void *kaddr;
> +	int bit = ENABLE_BIT(enabler);
>  	int ret;
>  
>  	lockdep_assert_held(&event_mutex);
> @@ -497,6 +527,8 @@ static int user_event_enabler_write(struct user_event_mm *mm,
>  		     test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler))))
>  		return -EBUSY;
>  
> +	align_addr_bit(&uaddr, &bit, ENABLE_BITOPS(enabler));
> +
>  	ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
>  				    &page, NULL);
>  
> @@ -515,9 +547,9 @@ static int user_event_enabler_write(struct user_event_mm *mm,
>  
>  	/* Update bit atomically, user tracers must be atomic as well */
>  	if (enabler->event && enabler->event->status)
> -		set_bit(ENABLE_BIT(enabler), ptr);
> +		set_bit(bit, ptr);
>  	else
> -		clear_bit(ENABLE_BIT(enabler), ptr);
> +		clear_bit(bit, ptr);
>  
>  	kunmap_local(kaddr);
>  	unpin_user_pages_dirty_lock(&page, 1, true);
> @@ -849,6 +881,12 @@ static struct user_event_enabler
>  	enabler->event = user;
>  	enabler->addr = uaddr;
>  	enabler->values = reg->enable_bit;
> +
> +#if BITS_PER_LONG >= 64
> +	if (reg->enable_size == 4)
> +		set_bit(ENABLE_VAL_32_ON_64_BIT, ENABLE_BITOPS(enabler));
> +#endif
> +
>  retry:
>  	/* Prevents state changes from racing with new enablers */
>  	mutex_lock(&event_mutex);
> @@ -2377,7 +2415,8 @@ static long user_unreg_get(struct user_unreg __user *ureg,
>  }
>  
>  static int user_event_mm_clear_bit(struct user_event_mm *user_mm,
> -				   unsigned long uaddr, unsigned char bit)
> +				   unsigned long uaddr, unsigned char bit,
> +				   unsigned long flags)
>  {
>  	struct user_event_enabler enabler;
>  	int result;
> @@ -2385,7 +2424,7 @@ static int user_event_mm_clear_bit(struct user_event_mm *user_mm,
>  
>  	memset(&enabler, 0, sizeof(enabler));
>  	enabler.addr = uaddr;
> -	enabler.values = bit;
> +	enabler.values = bit | flags;
>  retry:
>  	/* Prevents state changes from racing with new enablers */
>  	mutex_lock(&event_mutex);
> @@ -2415,6 +2454,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
>  	struct user_event_mm *mm = current->user_event_mm;
>  	struct user_event_enabler *enabler, *next;
>  	struct user_unreg reg;
> +	unsigned long flags;
>  	long ret;
>  
>  	ret = user_unreg_get(ureg, &reg);
> @@ -2425,6 +2465,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
>  	if (!mm)
>  		return -ENOENT;
>  
> +	flags = 0;
>  	ret = -ENOENT;
>  
>  	/*
> @@ -2441,6 +2482,9 @@ static long user_events_ioctl_unreg(unsigned long uarg)
>  		    ENABLE_BIT(enabler) == reg.disable_bit) {
>  			set_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler));
>  
> +			/* We must keep compat flags for the clear */
> +			flags |= enabler->values & ENABLE_VAL_COMPAT_MASK;
> +
>  			if (!test_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler)))
>  				user_event_enabler_destroy(enabler, true);
>  
> @@ -2454,7 +2498,7 @@ static long user_events_ioctl_unreg(unsigned long uarg)
>  	/* Ensure bit is now cleared for user, regardless of event status */
>  	if (!ret)
>  		ret = user_event_mm_clear_bit(mm, reg.disable_addr,
> -					      reg.disable_bit);
> +					      reg.disable_bit, flags);
>  
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-09-30 20:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-30 20:32 [for-linus][PATCH 0/4] tracing: Fixes for v6.6-rc3 Steven Rostedt
2023-09-30 20:32 ` [for-linus][PATCH 1/4] ring-buffer: Update "shortest_full" in polling Steven Rostedt
2023-09-30 20:32 ` [for-linus][PATCH 4/4] eventfs: Test for dentries array allocated in eventfs_release() Steven Rostedt
     [not found] ` <20230930203821.632661639@goodmis.org>
2023-09-30 20:41   ` [for-linus][PATCH 2/4] tracing: relax trace_event_eval_update() execution with cond_resched() Steven Rostedt
     [not found] ` <20230930203821.843544914@goodmis.org>
2023-09-30 20:41   ` [for-linus][PATCH 3/4] tracing/user_events: Align set_bit() address for all archs Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox