linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU
@ 2025-03-08 16:48 Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible Thomas Gleixner
                   ` (17 more replies)
  0 siblings, 18 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

This is a follow up on V2 of this work, which can be found here:

     https://lore.kernel.org/all/20250224095736.145530367@linutronix.de

It addresses the scalability problem of the posix timer hash and provides a
performant mechanism to restore Posix timers with a given ID along with a
couple of preperatory cleanups and enhancements. More details about
implementation choices are in the change logs and the cover letter of V1:

     https://lore.kernel.org/all/20250302185753.311903554@linutronix.de

Changes vs. V2:

  - Ensure consistency on timer_create() (new patch) - Frederic
  - Pick up the lock_timer() conditional unlock fix (was V2a)
  - Use proper defines in selftests
  - Pick up review/ack tags

The series survives all posix timer tests and did not show any regressions
so far.

The series is based on:

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip timers/core

and is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/posix

Thanks,

	tglx
---
Eric Dumazet (3):
      posix-timers: Initialise timer before adding it to the hash table
      posix-timers: Add cond_resched() to posix_timer_add() search loop
      posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t

Peter Zijlstra (1):
      posix-timers: Make lock_timer() use guard()

Thomas Gleixner (14):
      posix-timers: Ensure that timer initialization is fully visible
      posix-timers: Cleanup includes
      posix-timers: Remove a few paranoid warnings
      posix-timers: Remove SLAB_PANIC from kmem cache
      posix-timers: Use guards in a few places
      posix-timers: Simplify lock/unlock_timer()
      posix-timers: Rework timer removal
      posix-timers: Improve hash table performance
      posix-timers: Switch to jhash32()
      posix-timers: Avoid false cacheline sharing
      posix-timers: Make per process list RCU safe
      posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
      posix-timers: Provide a mechanism to allocate a given timer ID
      selftests/timers/posix-timers: Add a test for exact allocation mode


 fs/proc/base.c                                |   48 --
 include/linux/cleanup.h                       |   22 -
 include/linux/posix-timers.h                  |   30 +
 include/linux/sched/signal.h                  |    3 
 include/uapi/linux/prctl.h                    |   10 
 kernel/signal.c                               |    2 
 kernel/sys.c                                  |    5 
 kernel/time/posix-timers.c                    |  540 +++++++++++++-------------
 tools/testing/selftests/timers/posix_timers.c |   66 +++
 9 files changed, 418 insertions(+), 308 deletions(-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-08 21:39   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table Thomas Gleixner
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Frederic pointed out that the memory operations to initialize the timer are
not guaranteed to be visible, when __lock_timer() observes timer::it_signal
valid under timer::it_lock:

  T0                                      T1
  ---------                               -----------
  do_timer_create()
      // A
      new_timer->.... = ....
      spin_lock(current->sighand)
      // B
      WRITE_ONCE(new_timer->it_signal, current->signal)
      spin_unlock(current->sighand)
					sys_timer_*()
					   t =  __lock_timer()
						  spin_lock(&timr->it_lock)
						  // observes B
						  if (timr->it_signal == current->signal)
						    return timr;
			                   if (!t)
					       return;
					// Is not guaranteed to observe A

Protect the write of timer::it_signal, which makes the timer valid, with
timer::it_lock as well. This guarantees that T1 must observe the
initialization A completely, when it observes the valid signal pointer
under timer::it_lock. sighand::siglock must still be taken to protect the
signal::posix_timers list.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/posix-timers.c |   21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -462,14 +462,21 @@ static int do_timer_create(clockid_t whi
 	if (error)
 		goto out;
 
-	spin_lock_irq(&current->sighand->siglock);
-	/* This makes the timer valid in the hash table */
-	WRITE_ONCE(new_timer->it_signal, current->signal);
-	hlist_add_head(&new_timer->list, &current->signal->posix_timers);
-	spin_unlock_irq(&current->sighand->siglock);
 	/*
-	 * After unlocking sighand::siglock @new_timer is subject to
-	 * concurrent removal and cannot be touched anymore
+	 * timer::it_lock ensures that __lock_timer() observes a fully
+	 * initialized timer when it observes a valid timer::it_signal.
+	 *
+	 * sighand::siglock is required to protect signal::posix_timers.
+	 */
+	scoped_guard (spinlock_irq, &new_timer->it_lock) {
+		guard(spinlock)(&current->sighand->siglock);
+		/* This makes the timer valid in the hash table */
+		WRITE_ONCE(new_timer->it_signal, current->signal);
+		hlist_add_head(&new_timer->list, &current->signal->posix_timers);
+	}
+	/*
+	 * After unlocking @new_timer is subject to concurrent removal and
+	 * cannot be touched anymore
 	 */
 	return 0;
 out:


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-11 13:25   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
  2025-03-08 16:48 ` [patch V3 03/18] posix-timers: Add cond_resched() to posix_timer_add() search loop Thomas Gleixner
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

From: Eric Dumazet <edumazet@google.com>

A timer is only valid in the hashtable when both timer::it_signal and
timer::it_id are set to their final values, but timers are added without
those values being set.

The timer ID is allocated when the timer is added to the hash in invalid
state. The ID is taken from a monotonically increasing per process counter
which wraps around after reaching INT_MAX. The hash insertion validates
that there is no timer with the allocated ID in the hash table which
belongs to the same process. That opens a mostly theoretical race condition:

If other threads of the same process manage to create/delete timers in
rapid succession before the newly created timer is fully initialized and
wrap around to the timer ID which was handed out, then a duplicate timer ID
will be inserted into the hash table.

Prevent this by:

  1) Setting timer::it_id before inserting the timer into the hashtable.
 
  2) Storing the signal pointer in timer::it_signal with bit 0 set before
     inserting it into the hashtable.

     Bit 0 acts as a invalid bit, which means that the regular lookup for
     sys_timer_*() will fail the comparison with the signal pointer.

     But the lookup on insertion masks out bit 0 and can therefore detect a
     timer which is not yet valid, but allocated in the hash table.  Bit 0
     in the pointer is cleared once the initialization of the timer
     completed.

[ tglx: Fold ID and signal iniitializaion into one patch and massage change
  	log and comments. ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com
---
 kernel/time/posix-timers.c |   56 +++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 14 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -72,13 +72,13 @@ static int hash(struct signal_struct *si
 	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
 }
 
-static struct k_itimer *__posix_timers_find(struct hlist_head *head,
-					    struct signal_struct *sig,
-					    timer_t id)
+static struct k_itimer *posix_timer_by_id(timer_t id)
 {
+	struct signal_struct *sig = current->signal;
+	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
 	struct k_itimer *timer;
 
-	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+	hlist_for_each_entry_rcu(timer, head, t_hash) {
 		/* timer->it_signal can be set concurrently */
 		if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
 			return timer;
@@ -86,12 +86,26 @@ static struct k_itimer *__posix_timers_f
 	return NULL;
 }
 
-static struct k_itimer *posix_timer_by_id(timer_t id)
+static inline struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
 {
-	struct signal_struct *sig = current->signal;
-	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+	unsigned long val = (unsigned long)timer->it_signal;
+
+	/*
+	 * Mask out bit 0, which acts as invalid marker to prevent
+	 * posix_timer_by_id() detecting it as valid.
+	 */
+	return (struct signal_struct *)(val & ~1UL);
+}
+
+static bool posix_timer_hashed(struct hlist_head *head, struct signal_struct *sig, timer_t id)
+{
+	struct k_itimer *timer;
 
-	return __posix_timers_find(head, sig, id);
+	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+		if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
+			return true;
+	}
+	return false;
 }
 
 static int posix_timer_add(struct k_itimer *timer)
@@ -112,7 +126,19 @@ static int posix_timer_add(struct k_itim
 		sig->next_posix_timer_id = (id + 1) & INT_MAX;
 
 		head = &posix_timers_hashtable[hash(sig, id)];
-		if (!__posix_timers_find(head, sig, id)) {
+		if (!posix_timer_hashed(head, sig, id)) {
+			/*
+			 * Set the timer ID and the signal pointer to make
+			 * it identifiable in the hash table. The signal
+			 * pointer has bit 0 set to indicate that it is not
+			 * yet fully initialized. posix_timer_hashed()
+			 * masks this bit out, but the syscall lookup fails
+			 * to match due to it being set. This guarantees
+			 * that there can't be duplicate timer IDs handed
+			 * out.
+			 */
+			timer->it_id = (timer_t)id;
+			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
 			hlist_add_head_rcu(&timer->t_hash, head);
 			spin_unlock(&hash_lock);
 			return id;
@@ -406,8 +432,7 @@ static int do_timer_create(clockid_t whi
 
 	/*
 	 * Add the timer to the hash table. The timer is not yet valid
-	 * because new_timer::it_signal is still NULL. The timer id is also
-	 * not yet visible to user space.
+	 * after insertion, but has a unique ID allocated.
 	 */
 	new_timer_id = posix_timer_add(new_timer);
 	if (new_timer_id < 0) {
@@ -415,7 +440,6 @@ static int do_timer_create(clockid_t whi
 		return new_timer_id;
 	}
 
-	new_timer->it_id = (timer_t) new_timer_id;
 	new_timer->it_clock = which_clock;
 	new_timer->kclock = kc;
 	new_timer->it_overrun = -1LL;
@@ -453,7 +477,7 @@ static int do_timer_create(clockid_t whi
 	}
 	/*
 	 * After succesful copy out, the timer ID is visible to user space
-	 * now but not yet valid because new_timer::signal is still NULL.
+	 * now but not yet valid because new_timer::signal low order bit is 1.
 	 *
 	 * Complete the initialization with the clock specific create
 	 * callback.
@@ -470,7 +494,11 @@ static int do_timer_create(clockid_t whi
 	 */
 	scoped_guard (spinlock_irq, &new_timer->it_lock) {
 		guard(spinlock)(&current->sighand->siglock);
-		/* This makes the timer valid in the hash table */
+		/*
+		 * new_timer::it_signal contains the signal pointer with
+		 * bit 0 set, which makes it invalid for syscall operations.
+		 * Store the unmodified signal pointer to make it valid.
+		 */
 		WRITE_ONCE(new_timer->it_signal, current->signal);
 		hlist_add_head(&new_timer->list, &current->signal->posix_timers);
 	}


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 03/18] posix-timers: Add cond_resched() to posix_timer_add() search loop
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
  2025-03-08 16:48 ` [patch V3 04/18] posix-timers: Cleanup includes Thomas Gleixner
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

From: Eric Dumazet <edumazet@google.com>

With a large number of POSIX timers the search for a valid ID might cause a
soft lockup on PREEMPT_NONE/VOLUNTARY kernels.

Add cond_resched() to the loop to prevent that.

[ tglx: Split out from Eric's series ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com

---
 kernel/time/posix-timers.c |    1 +
 1 file changed, 1 insertion(+)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -144,6 +144,7 @@ static int posix_timer_add(struct k_itim
 			return id;
 		}
 		spin_unlock(&hash_lock);
+		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
 	return -EAGAIN;




^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 04/18] posix-timers: Cleanup includes
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (2 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 03/18] posix-timers: Add cond_resched() to posix_timer_add() search loop Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 05/18] posix-timers: Remove a few paranoid warnings Thomas Gleixner
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Remove pointless includes and sort the remaining ones alphabetically.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>

---
 kernel/time/posix-timers.c |   26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -9,28 +9,22 @@
  *
  * These are all the functions necessary to implement POSIX clocks & timers
  */
-#include <linux/mm.h>
-#include <linux/interrupt.h>
-#include <linux/slab.h>
-#include <linux/time.h>
-#include <linux/mutex.h>
-#include <linux/sched/task.h>
-
-#include <linux/uaccess.h>
-#include <linux/list.h>
-#include <linux/init.h>
+#include <linux/compat.h>
 #include <linux/compiler.h>
 #include <linux/hash.h>
+#include <linux/hashtable.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/list.h>
+#include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
+#include <linux/sched/task.h>
+#include <linux/slab.h>
 #include <linux/syscalls.h>
-#include <linux/wait.h>
-#include <linux/workqueue.h>
-#include <linux/export.h>
-#include <linux/hashtable.h>
-#include <linux/compat.h>
-#include <linux/nospec.h>
+#include <linux/time.h>
 #include <linux/time_namespace.h>
+#include <linux/uaccess.h>
 
 #include "timekeeping.h"
 #include "posix-timers.h"




^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 05/18] posix-timers: Remove a few paranoid warnings
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (3 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 04/18] posix-timers: Cleanup includes Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 06/18] posix-timers: Remove SLAB_PANIC from kmem cache Thomas Gleixner
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Warnings about a non-initialized timer or non-existing callbacks are just
useful for implementing new posix clocks, but there a NULL pointer
dereference is expected anyway. :)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

---
V2: New patch
---
 kernel/time/posix-timers.c |   37 ++++++++-----------------------------
 1 file changed, 8 insertions(+), 29 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -682,7 +682,6 @@ void common_timer_get(struct k_itimer *t
 
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
-	const struct k_clock *kc;
 	struct k_itimer *timr;
 	unsigned long flags;
 	int ret = 0;
@@ -692,11 +691,7 @@ static int do_timer_gettime(timer_t time
 		return -EINVAL;
 
 	memset(setting, 0, sizeof(*setting));
-	kc = timr->kclock;
-	if (WARN_ON_ONCE(!kc || !kc->timer_get))
-		ret = -EINVAL;
-	else
-		kc->timer_get(timr, setting);
+	timr->kclock->timer_get(timr, setting);
 
 	unlock_timer(timr, flags);
 	return ret;
@@ -824,7 +819,6 @@ static void common_timer_wait_running(st
 static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 					   unsigned long *flags)
 {
-	const struct k_clock *kc = READ_ONCE(timer->kclock);
 	timer_t timer_id = READ_ONCE(timer->it_id);
 
 	/* Prevent kfree(timer) after dropping the lock */
@@ -835,8 +829,7 @@ static struct k_itimer *timer_wait_runni
 	 * kc->timer_wait_running() might drop RCU lock. So @timer
 	 * cannot be touched anymore after the function returns!
 	 */
-	if (!WARN_ON_ONCE(!kc->timer_wait_running))
-		kc->timer_wait_running(timer);
+	timer->kclock->timer_wait_running(timer);
 
 	rcu_read_unlock();
 	/* Relock the timer. It might be not longer hashed. */
@@ -899,7 +892,6 @@ static int do_timer_settime(timer_t time
 			    struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	const struct k_clock *kc;
 	struct k_itimer *timr;
 	unsigned long flags;
 	int error;
@@ -922,11 +914,7 @@ static int do_timer_settime(timer_t time
 	/* Prevent signal delivery and rearming. */
 	timr->it_signal_seq++;
 
-	kc = timr->kclock;
-	if (WARN_ON_ONCE(!kc || !kc->timer_set))
-		error = -EINVAL;
-	else
-		error = kc->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+	error = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
 
 	if (error == TIMER_RETRY) {
 		// We already got the old time...
@@ -1008,18 +996,6 @@ static inline void posix_timer_cleanup_i
 	}
 }
 
-static inline int timer_delete_hook(struct k_itimer *timer)
-{
-	const struct k_clock *kc = timer->kclock;
-
-	/* Prevent signal delivery and rearming. */
-	timer->it_signal_seq++;
-
-	if (WARN_ON_ONCE(!kc || !kc->timer_del))
-		return -EINVAL;
-	return kc->timer_del(timer);
-}
-
 /* Delete a POSIX.1b interval timer. */
 SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
@@ -1032,7 +1008,10 @@ SYSCALL_DEFINE1(timer_delete, timer_t, t
 	if (!timer)
 		return -EINVAL;
 
-	if (unlikely(timer_delete_hook(timer) == TIMER_RETRY)) {
+	/* Prevent signal delivery and rearming. */
+	timer->it_signal_seq++;
+
+	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
 		/* Unlocks and relocks the timer if it still exists */
 		timer = timer_wait_running(timer, &flags);
 		goto retry_delete;
@@ -1078,7 +1057,7 @@ static void itimer_delete(struct k_itime
 	 * mechanism. Worse, that timer mechanism might run the expiry
 	 * function concurrently.
 	 */
-	if (timer_delete_hook(timer) == TIMER_RETRY) {
+	if (timer->kclock->timer_del(timer) == TIMER_RETRY) {
 		/*
 		 * Timer is expired concurrently, prevent livelocks
 		 * and pointless spinning on RT.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 06/18] posix-timers: Remove SLAB_PANIC from kmem cache
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (4 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 05/18] posix-timers: Remove a few paranoid warnings Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 07/18] posix-timers: Use guards in a few places Thomas Gleixner
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

There is no need to panic when the posix-timer kmem_cache can't be
created. timer_create() will fail with -ENOMEM and that's it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
V2: New patch
---
 kernel/time/posix-timers.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -243,9 +243,8 @@ static int posix_get_hrtimer_res(clockid
 
 static __init int init_posix_timers(void)
 {
-	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof(struct k_itimer), 0,
-					SLAB_PANIC | SLAB_ACCOUNT, NULL);
+	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer), 0,
+					       SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
@@ -371,8 +370,12 @@ static struct pid *good_sigevent(sigeven
 
 static struct k_itimer *alloc_posix_timer(void)
 {
-	struct k_itimer *tmr = kmem_cache_zalloc(posix_timers_cache, GFP_KERNEL);
+	struct k_itimer *tmr;
 
+	if (unlikely(!posix_timers_cache))
+		return NULL;
+
+	tmr = kmem_cache_zalloc(posix_timers_cache, GFP_KERNEL);
 	if (!tmr)
 		return tmr;
 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 07/18] posix-timers: Use guards in a few places
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (5 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 06/18] posix-timers: Remove SLAB_PANIC from kmem cache Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 08/18] posix-timers: Simplify lock/unlock_timer() Thomas Gleixner
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Switch locking and RCU to guards where applicable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
V2: New patch
---
 kernel/time/posix-timers.c |   68 +++++++++++++++++++--------------------------
 1 file changed, 30 insertions(+), 38 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -397,9 +397,8 @@ void posixtimer_free_timer(struct k_itim
 
 static void posix_timer_unhash_and_free(struct k_itimer *tmr)
 {
-	spin_lock(&hash_lock);
-	hlist_del_rcu(&tmr->t_hash);
-	spin_unlock(&hash_lock);
+	scoped_guard (spinlock, &hash_lock)
+		hlist_del_rcu(&tmr->t_hash);
 	posixtimer_putref(tmr);
 }
 
@@ -443,9 +442,8 @@ static int do_timer_create(clockid_t whi
 	new_timer->it_overrun = -1LL;
 
 	if (event) {
-		rcu_read_lock();
-		new_timer->it_pid = get_pid(good_sigevent(event));
-		rcu_read_unlock();
+		scoped_guard (rcu)
+			new_timer->it_pid = get_pid(good_sigevent(event));
 		if (!new_timer->it_pid) {
 			error = -EINVAL;
 			goto out;
@@ -579,7 +577,7 @@ static struct k_itimer *__lock_timer(tim
 	 * can't change, but timr::it_signal becomes NULL during
 	 * destruction.
 	 */
-	rcu_read_lock();
+	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
 	if (timr) {
 		spin_lock_irqsave(&timr->it_lock, *flags);
@@ -587,14 +585,10 @@ static struct k_itimer *__lock_timer(tim
 		 * Validate under timr::it_lock that timr::it_signal is
 		 * still valid. Pairs with #1 above.
 		 */
-		if (timr->it_signal == current->signal) {
-			rcu_read_unlock();
+		if (timr->it_signal == current->signal)
 			return timr;
-		}
 		spin_unlock_irqrestore(&timr->it_lock, *flags);
 	}
-	rcu_read_unlock();
-
 	return NULL;
 }
 
@@ -825,16 +819,15 @@ static struct k_itimer *timer_wait_runni
 	timer_t timer_id = READ_ONCE(timer->it_id);
 
 	/* Prevent kfree(timer) after dropping the lock */
-	rcu_read_lock();
-	unlock_timer(timer, *flags);
-
-	/*
-	 * kc->timer_wait_running() might drop RCU lock. So @timer
-	 * cannot be touched anymore after the function returns!
-	 */
-	timer->kclock->timer_wait_running(timer);
+	scoped_guard (rcu) {
+		unlock_timer(timer, *flags);
+		/*
+		 * kc->timer_wait_running() might drop RCU lock. So @timer
+		 * cannot be touched anymore after the function returns!
+		 */
+		timer->kclock->timer_wait_running(timer);
+	}
 
-	rcu_read_unlock();
 	/* Relock the timer. It might be not longer hashed. */
 	return lock_timer(timer_id, flags);
 }
@@ -1020,20 +1013,20 @@ SYSCALL_DEFINE1(timer_delete, timer_t, t
 		goto retry_delete;
 	}
 
-	spin_lock(&current->sighand->siglock);
-	hlist_del(&timer->list);
-	posix_timer_cleanup_ignored(timer);
-	/*
-	 * A concurrent lookup could check timer::it_signal lockless. It
-	 * will reevaluate with timer::it_lock held and observe the NULL.
-	 *
-	 * It must be written with siglock held so that the signal code
-	 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
-	 * which prevents it from moving a pending signal of a deleted
-	 * timer to the ignore list.
-	 */
-	WRITE_ONCE(timer->it_signal, NULL);
-	spin_unlock(&current->sighand->siglock);
+	scoped_guard (spinlock, &current->sighand->siglock) {
+		hlist_del(&timer->list);
+		posix_timer_cleanup_ignored(timer);
+		/*
+		 * A concurrent lookup could check timer::it_signal lockless. It
+		 * will reevaluate with timer::it_lock held and observe the NULL.
+		 *
+		 * It must be written with siglock held so that the signal code
+		 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
+		 * which prevents it from moving a pending signal of a deleted
+		 * timer to the ignore list.
+		 */
+		WRITE_ONCE(timer->it_signal, NULL);
+	}
 
 	unlock_timer(timer, flags);
 	posix_timer_unhash_and_free(timer);
@@ -1106,9 +1099,8 @@ void exit_itimers(struct task_struct *ts
 		return;
 
 	/* Protect against concurrent read via /proc/$PID/timers */
-	spin_lock_irq(&tsk->sighand->siglock);
-	hlist_move_list(&tsk->signal->posix_timers, &timers);
-	spin_unlock_irq(&tsk->sighand->siglock);
+	scoped_guard (spinlock_irq, &tsk->sighand->siglock)
+		hlist_move_list(&tsk->signal->posix_timers, &timers);
 
 	/* The timers are not longer accessible via tsk::signal */
 	while (!hlist_empty(&timers)) {


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 08/18] posix-timers: Simplify lock/unlock_timer()
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (6 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 07/18] posix-timers: Use guards in a few places Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 09/18] posix-timers: Rework timer removal Thomas Gleixner
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Since the integration of sigqueue into the timer struct, lock_timer() is
only used in task context. So taking the lock with irqsave() is not longer
required.

Convert it to use spin_[un]lock_irq().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
V2: New patch
---
 kernel/time/posix-timers.c |   70 ++++++++++++++++++---------------------------
 1 file changed, 29 insertions(+), 41 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -53,14 +53,19 @@ static const struct k_clock clock_realti
 #error "SIGEV_THREAD_ID must not share bit with other SIGEV values!"
 #endif
 
-static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags);
+static struct k_itimer *__lock_timer(timer_t timer_id);
 
-#define lock_timer(tid, flags)						   \
-({	struct k_itimer *__timr;					   \
-	__cond_lock(&__timr->it_lock, __timr = __lock_timer(tid, flags));  \
-	__timr;								   \
+#define lock_timer(tid)							\
+({	struct k_itimer *__timr;					\
+	__cond_lock(&__timr->it_lock, __timr = __lock_timer(tid));	\
+	__timr;								\
 })
 
+static inline void unlock_timer(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timr->it_lock);
+}
+
 static int hash(struct signal_struct *sig, unsigned int nr)
 {
 	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
@@ -144,11 +149,6 @@ static int posix_timer_add(struct k_itim
 	return -EAGAIN;
 }
 
-static inline void unlock_timer(struct k_itimer *timr, unsigned long flags)
-{
-	spin_unlock_irqrestore(&timr->it_lock, flags);
-}
-
 static int posix_get_realtime_timespec(clockid_t which_clock, struct timespec64 *tp)
 {
 	ktime_get_real_ts64(tp);
@@ -538,7 +538,7 @@ COMPAT_SYSCALL_DEFINE3(timer_create, clo
 }
 #endif
 
-static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
+static struct k_itimer *__lock_timer(timer_t timer_id)
 {
 	struct k_itimer *timr;
 
@@ -580,14 +580,14 @@ static struct k_itimer *__lock_timer(tim
 	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
 	if (timr) {
-		spin_lock_irqsave(&timr->it_lock, *flags);
+		spin_lock_irq(&timr->it_lock);
 		/*
 		 * Validate under timr::it_lock that timr::it_signal is
 		 * still valid. Pairs with #1 above.
 		 */
 		if (timr->it_signal == current->signal)
 			return timr;
-		spin_unlock_irqrestore(&timr->it_lock, *flags);
+		spin_unlock_irq(&timr->it_lock);
 	}
 	return NULL;
 }
@@ -680,17 +680,16 @@ void common_timer_get(struct k_itimer *t
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
 	struct k_itimer *timr;
-	unsigned long flags;
 	int ret = 0;
 
-	timr = lock_timer(timer_id, &flags);
+	timr = lock_timer(timer_id);
 	if (!timr)
 		return -EINVAL;
 
 	memset(setting, 0, sizeof(*setting));
 	timr->kclock->timer_get(timr, setting);
 
-	unlock_timer(timr, flags);
+	unlock_timer(timr);
 	return ret;
 }
 
@@ -746,15 +745,14 @@ SYSCALL_DEFINE2(timer_gettime32, timer_t
 SYSCALL_DEFINE1(timer_getoverrun, timer_t, timer_id)
 {
 	struct k_itimer *timr;
-	unsigned long flags;
 	int overrun;
 
-	timr = lock_timer(timer_id, &flags);
+	timr = lock_timer(timer_id);
 	if (!timr)
 		return -EINVAL;
 
 	overrun = timer_overrun_to_int(timr);
-	unlock_timer(timr, flags);
+	unlock_timer(timr);
 
 	return overrun;
 }
@@ -813,14 +811,13 @@ static void common_timer_wait_running(st
  * when the task which tries to delete or disarm the timer has preempted
  * the task which runs the expiry in task work context.
  */
-static struct k_itimer *timer_wait_running(struct k_itimer *timer,
-					   unsigned long *flags)
+static struct k_itimer *timer_wait_running(struct k_itimer *timer)
 {
 	timer_t timer_id = READ_ONCE(timer->it_id);
 
 	/* Prevent kfree(timer) after dropping the lock */
 	scoped_guard (rcu) {
-		unlock_timer(timer, *flags);
+		unlock_timer(timer);
 		/*
 		 * kc->timer_wait_running() might drop RCU lock. So @timer
 		 * cannot be touched anymore after the function returns!
@@ -829,7 +826,7 @@ static struct k_itimer *timer_wait_runni
 	}
 
 	/* Relock the timer. It might be not longer hashed. */
-	return lock_timer(timer_id, flags);
+	return lock_timer(timer_id);
 }
 
 /*
@@ -889,7 +886,6 @@ static int do_timer_settime(timer_t time
 			    struct itimerspec64 *old_spec64)
 {
 	struct k_itimer *timr;
-	unsigned long flags;
 	int error;
 
 	if (!timespec64_valid(&new_spec64->it_interval) ||
@@ -899,7 +895,7 @@ static int do_timer_settime(timer_t time
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	timr = lock_timer(timer_id, &flags);
+	timr = lock_timer(timer_id);
 retry:
 	if (!timr)
 		return -EINVAL;
@@ -916,10 +912,10 @@ static int do_timer_settime(timer_t time
 		// We already got the old time...
 		old_spec64 = NULL;
 		/* Unlocks and relocks the timer if it still exists */
-		timr = timer_wait_running(timr, &flags);
+		timr = timer_wait_running(timr);
 		goto retry;
 	}
-	unlock_timer(timr, flags);
+	unlock_timer(timr);
 
 	return error;
 }
@@ -995,10 +991,7 @@ static inline void posix_timer_cleanup_i
 /* Delete a POSIX.1b interval timer. */
 SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	struct k_itimer *timer;
-	unsigned long flags;
-
-	timer = lock_timer(timer_id, &flags);
+	struct k_itimer *timer = lock_timer(timer_id);
 
 retry_delete:
 	if (!timer)
@@ -1009,7 +1002,7 @@ SYSCALL_DEFINE1(timer_delete, timer_t, t
 
 	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
 		/* Unlocks and relocks the timer if it still exists */
-		timer = timer_wait_running(timer, &flags);
+		timer = timer_wait_running(timer);
 		goto retry_delete;
 	}
 
@@ -1028,7 +1021,7 @@ SYSCALL_DEFINE1(timer_delete, timer_t, t
 		WRITE_ONCE(timer->it_signal, NULL);
 	}
 
-	unlock_timer(timer, flags);
+	unlock_timer(timer);
 	posix_timer_unhash_and_free(timer);
 	return 0;
 }
@@ -1039,12 +1032,7 @@ SYSCALL_DEFINE1(timer_delete, timer_t, t
  */
 static void itimer_delete(struct k_itimer *timer)
 {
-	unsigned long flags;
-
-	/*
-	 * irqsave is required to make timer_wait_running() work.
-	 */
-	spin_lock_irqsave(&timer->it_lock, flags);
+	spin_lock_irq(&timer->it_lock);
 
 retry_delete:
 	/*
@@ -1065,7 +1053,7 @@ static void itimer_delete(struct k_itime
 		 * do_exit() only for the last thread of the thread group.
 		 * So no other task can access and delete that timer.
 		 */
-		if (WARN_ON_ONCE(timer_wait_running(timer, &flags) != timer))
+		if (WARN_ON_ONCE(timer_wait_running(timer) != timer))
 			return;
 
 		goto retry_delete;
@@ -1082,7 +1070,7 @@ static void itimer_delete(struct k_itime
 	 */
 	WRITE_ONCE(timer->it_signal, NULL);
 
-	spin_unlock_irqrestore(&timer->it_lock, flags);
+	spin_unlock_irq(&timer->it_lock);
 	posix_timer_unhash_and_free(timer);
 }
 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 09/18] posix-timers: Rework timer removal
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (7 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 08/18] posix-timers: Simplify lock/unlock_timer() Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-09 23:17   ` Frederic Weisbecker
  2025-03-10  8:13   ` [patch V3a " Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 10/18] posix-timers: Make lock_timer() use guard() Thomas Gleixner
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

sys_timer_delete() and the do_exit() cleanup function itimer_delete() are
doing the same thing, but have needlessly different implementations instead
of sharing the code.

The other oddity of timer deletion is the fact that the timer is not
invalidated before the actual deletion happens, which allows concurrent
lookups to succeed.

That's wrong because a timer which is in the process of being deleted
should not be visible and any actions like signal queueing, delivery and
rearming should not happen once the task, which invoked timer_delete(), has
the timer locked.

Rework the code so that:

   1) The signal queueing and delivery code ignore timers which are marked
      invalid

   2) The deletion implementation between sys_timer_delete() and
      itimer_delete() is shared

   3) The timer is invalidated and removed from the linked lists before
      the deletion callback of the relevant clock is invoked.

      That requires to rework timer_wait_running() as it does a lookup of
      the timer when relocking it at the end. In case of deletion this
      lookup would fail due to the preceding invalidation and the wait loop
      would terminate prematurely.

      But due to the preceding invalidation the timer cannot be accessed by
      other tasks anymore, so there is no way that the timer has been freed
      after the timer lock has been dropped.

      Move the re-validation out of timer_wait_running() and handle it at
      the only other usage site, timer_settime().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V2: Simplify timer_wait_running() locking - PeterZ
---
 include/linux/posix-timers.h |    7 +
 kernel/signal.c              |    2 
 kernel/time/posix-timers.c   |  194 ++++++++++++++++++-------------------------
 3 files changed, 90 insertions(+), 113 deletions(-)

--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -240,6 +240,13 @@ static inline void posixtimer_sigqueue_p
 
 	posixtimer_putref(tmr);
 }
+
+static inline bool posixtimer_valid(const struct k_itimer *timer)
+{
+	unsigned long val = (unsigned long)timer->it_signal;
+
+	return !(val & 0x1UL);
+}
 #else  /* CONFIG_POSIX_TIMERS */
 static inline void posixtimer_sigqueue_getref(struct sigqueue *q) { }
 static inline void posixtimer_sigqueue_putref(struct sigqueue *q) { }
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2092,7 +2092,7 @@ static inline void posixtimer_sig_ignore
 	 * from a non-periodic timer, then just drop the reference
 	 * count. Otherwise queue it on the ignored list.
 	 */
-	if (tmr->it_signal && tmr->it_sig_periodic)
+	if (posixtimer_valid(tmr) && tmr->it_sig_periodic)
 		hlist_add_head(&tmr->ignored_list, &tsk->signal->ignored_posix_timers);
 	else
 		posixtimer_putref(tmr);
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -279,7 +279,7 @@ static bool __posixtimer_deliver_signal(
 	 * since the signal was queued. In either case, don't rearm and
 	 * drop the signal.
 	 */
-	if (timr->it_signal_seq != timr->it_sigqueue_seq || WARN_ON_ONCE(!timr->it_signal))
+	if (timr->it_signal_seq != timr->it_sigqueue_seq || !posixtimer_valid(timr))
 		return false;
 
 	if (!timr->it_interval || WARN_ON_ONCE(timr->it_status != POSIX_TIMER_REQUEUE_PENDING))
@@ -324,6 +324,9 @@ void posix_timer_queue_signal(struct k_i
 {
 	lockdep_assert_held(&timr->it_lock);
 
+	if (!posixtimer_valid(timr))
+		return;
+
 	timr->it_status = timr->it_interval ? POSIX_TIMER_REQUEUE_PENDING : POSIX_TIMER_DISARMED;
 	posixtimer_send_sigqueue(timr);
 }
@@ -553,11 +556,11 @@ static struct k_itimer *__lock_timer(tim
 	 * The hash lookup and the timers are RCU protected.
 	 *
 	 * Timers are added to the hash in invalid state where
-	 * timr::it_signal == NULL. timer::it_signal is only set after the
-	 * rest of the initialization succeeded.
+	 * timr::it_signal is marked invalid. timer::it_signal is only set
+	 * after the rest of the initialization succeeded.
 	 *
 	 * Timer destruction happens in steps:
-	 *  1) Set timr::it_signal to NULL with timr::it_lock held
+	 *  1) Set timr::it_signal marked invalid with timr::it_lock held
 	 *  2) Release timr::it_lock
 	 *  3) Remove from the hash under hash_lock
 	 *  4) Put the reference count.
@@ -574,8 +577,8 @@ static struct k_itimer *__lock_timer(tim
 	 *
 	 * The lookup validates locklessly that timr::it_signal ==
 	 * current::it_signal and timr::it_id == @timer_id. timr::it_id
-	 * can't change, but timr::it_signal becomes NULL during
-	 * destruction.
+	 * can't change, but timr::it_signal can become invalid during
+	 * destruction, which makes the locked check fail.
 	 */
 	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
@@ -811,22 +814,13 @@ static void common_timer_wait_running(st
  * when the task which tries to delete or disarm the timer has preempted
  * the task which runs the expiry in task work context.
  */
-static struct k_itimer *timer_wait_running(struct k_itimer *timer)
+static void timer_wait_running(struct k_itimer *timer)
 {
-	timer_t timer_id = READ_ONCE(timer->it_id);
-
-	/* Prevent kfree(timer) after dropping the lock */
-	scoped_guard (rcu) {
-		unlock_timer(timer);
-		/*
-		 * kc->timer_wait_running() might drop RCU lock. So @timer
-		 * cannot be touched anymore after the function returns!
-		 */
-		timer->kclock->timer_wait_running(timer);
-	}
-
-	/* Relock the timer. It might be not longer hashed. */
-	return lock_timer(timer_id);
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
+	timer->kclock->timer_wait_running(timer);
 }
 
 /*
@@ -885,8 +879,7 @@ static int do_timer_settime(timer_t time
 			    struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	struct k_itimer *timr;
-	int error;
+	int ret;
 
 	if (!timespec64_valid(&new_spec64->it_interval) ||
 	    !timespec64_valid(&new_spec64->it_value))
@@ -895,29 +888,36 @@ static int do_timer_settime(timer_t time
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	timr = lock_timer(timer_id);
-retry:
-	if (!timr)
-		return -EINVAL;
+	for (;;) {
+		struct k_itimer *timr = lock_timer(timer_id);
 
-	if (old_spec64)
-		old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
+		if (!timr)
+			return -EINVAL;
 
-	/* Prevent signal delivery and rearming. */
-	timr->it_signal_seq++;
+		if (old_spec64)
+			old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
 
-	error = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+		/* Prevent signal delivery and rearming. */
+		timr->it_signal_seq++;
+
+		ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+		if (ret != TIMER_RETRY) {
+			unlock_timer(timr);
+			break;
+		}
 
-	if (error == TIMER_RETRY) {
-		// We already got the old time...
+		/* Read the old time only once */
 		old_spec64 = NULL;
-		/* Unlocks and relocks the timer if it still exists */
-		timr = timer_wait_running(timr);
-		goto retry;
+		/* Protect the timer from being freed after the lock is dropped */
+		guard(rcu)();
+		unlock_timer(timr);
+		/*
+		 * timer_wait_running() might drop RCU read side protection
+		 * so the timer has to be looked up again!
+		 */
+		timer_wait_running(timr);
 	}
-	unlock_timer(timr);
-
-	return error;
+	return ret;
 }
 
 /* Set a POSIX.1b interval timer */
@@ -988,90 +988,56 @@ static inline void posix_timer_cleanup_i
 	}
 }
 
-/* Delete a POSIX.1b interval timer. */
-SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
+static void posix_timer_delete(struct k_itimer *timer)
 {
-	struct k_itimer *timer = lock_timer(timer_id);
-
-retry_delete:
-	if (!timer)
-		return -EINVAL;
-
-	/* Prevent signal delivery and rearming. */
+	/*
+	 * Invalidate the timer, remove it from the linked list and remove
+	 * it from the ignored list if pending.
+	 *
+	 * The invalidation must be written with siglock held so that the
+	 * signal code observes timer->it_valid == false in do_sigaction(),
+	 * which prevents it from moving a pending signal of a deleted
+	 * timer to the ignore list.
+	 *
+	 * The invalidation also prevents signal queueing, signal delivery
+	 * and therefore rearming from the signal delivery path.
+	 *
+	 * A concurrent lookup can still find the timer in the hash, but it
+	 * will check timer::it_signal with timer::it_lock held and observe
+	 * bit 0 set, which invalidates it. That also prevents the timer ID
+	 * from being handed out before this timer is completely gone.
+	 */
 	timer->it_signal_seq++;
 
-	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
-		/* Unlocks and relocks the timer if it still exists */
-		timer = timer_wait_running(timer);
-		goto retry_delete;
-	}
-
 	scoped_guard (spinlock, &current->sighand->siglock) {
+		unsigned long sig = (unsigned long)timer->it_signal | 1UL;
+
+		WRITE_ONCE(timer->it_signal, (struct signal_struct *)sig);
 		hlist_del(&timer->list);
 		posix_timer_cleanup_ignored(timer);
-		/*
-		 * A concurrent lookup could check timer::it_signal lockless. It
-		 * will reevaluate with timer::it_lock held and observe the NULL.
-		 *
-		 * It must be written with siglock held so that the signal code
-		 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
-		 * which prevents it from moving a pending signal of a deleted
-		 * timer to the ignore list.
-		 */
-		WRITE_ONCE(timer->it_signal, NULL);
 	}
 
-	unlock_timer(timer);
-	posix_timer_unhash_and_free(timer);
-	return 0;
+	while (timer->kclock->timer_del(timer) == TIMER_RETRY) {
+		guard(rcu)();
+		spin_unlock_irq(&timer->it_lock);
+		timer_wait_running(timer);
+		spin_lock_irq(&timer->it_lock);
+	}
 }
 
-/*
- * Delete a timer if it is armed, remove it from the hash and schedule it
- * for RCU freeing.
- */
-static void itimer_delete(struct k_itimer *timer)
+/* Delete a POSIX.1b interval timer. */
+SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	spin_lock_irq(&timer->it_lock);
-
-retry_delete:
-	/*
-	 * Even if the timer is not longer accessible from other tasks
-	 * it still might be armed and queued in the underlying timer
-	 * mechanism. Worse, that timer mechanism might run the expiry
-	 * function concurrently.
-	 */
-	if (timer->kclock->timer_del(timer) == TIMER_RETRY) {
-		/*
-		 * Timer is expired concurrently, prevent livelocks
-		 * and pointless spinning on RT.
-		 *
-		 * timer_wait_running() drops timer::it_lock, which opens
-		 * the possibility for another task to delete the timer.
-		 *
-		 * That's not possible here because this is invoked from
-		 * do_exit() only for the last thread of the thread group.
-		 * So no other task can access and delete that timer.
-		 */
-		if (WARN_ON_ONCE(timer_wait_running(timer) != timer))
-			return;
-
-		goto retry_delete;
-	}
-	hlist_del(&timer->list);
-
-	posix_timer_cleanup_ignored(timer);
+	struct k_itimer *timer = lock_timer(timer_id);
 
-	/*
-	 * Setting timer::it_signal to NULL is technically not required
-	 * here as nothing can access the timer anymore legitimately via
-	 * the hash table. Set it to NULL nevertheless so that all deletion
-	 * paths are consistent.
-	 */
-	WRITE_ONCE(timer->it_signal, NULL);
+	if (!timer)
+		return -EINVAL;
 
-	spin_unlock_irq(&timer->it_lock);
+	posix_timer_delete(timer);
+	unlock_timer(timer);
+	/* Remove it from the hash, which frees up the timer ID */
 	posix_timer_unhash_and_free(timer);
+	return 0;
 }
 
 /*
@@ -1082,6 +1048,8 @@ static void itimer_delete(struct k_itime
 void exit_itimers(struct task_struct *tsk)
 {
 	struct hlist_head timers;
+	struct hlist_node *next;
+	struct k_itimer *timer;
 
 	if (hlist_empty(&tsk->signal->posix_timers))
 		return;
@@ -1091,8 +1059,10 @@ void exit_itimers(struct task_struct *ts
 		hlist_move_list(&tsk->signal->posix_timers, &timers);
 
 	/* The timers are not longer accessible via tsk::signal */
-	while (!hlist_empty(&timers)) {
-		itimer_delete(hlist_entry(timers.first, struct k_itimer, list));
+	hlist_for_each_entry_safe(timer, next, &timers, list) {
+		scoped_guard (spinlock_irq, &timer->it_lock)
+			posix_timer_delete(timer);
+		posix_timer_unhash_and_free(timer);
 		cond_resched();
 	}
 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 10/18] posix-timers: Make lock_timer() use guard()
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (8 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 09/18] posix-timers: Rework timer removal Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-10 11:57   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Peter Zijlstra
  2025-03-08 16:48 ` [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t Thomas Gleixner
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov


From: Peter Zijlstra <peterz@infradead.org>

The lookup and locking of posix timers requires the same repeating pattern
at all usage sites:

   tmr = lock_timer(tiner_id);
   if (!tmr)
   	return -EINVAL;
   ....
   unlock_timer(tmr);

Solve this with a guard implementation, which works in most places out of
the box except for those, which need to unlock the timer inside the guard
scope.

Though the only places where this matters are timer_delete() and
timer_settime(). In both cases the timer pointer needs to be preserved
across the end of the scope, which is solved by storing the pointer in a
variable outside of the scope.

timer_settime() also has to protect the timer with RCU before unlocking,
which obviously can't use guard(rcu) before leaving the guard scope as that
guard is cleaned up before the unlock. Solve this by providing the RCU
protection open coded.

[ tglx: Made it work and added change log ]

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net
---
V2a: Make unlock conditional - 0day
V2: New patch
---
 include/linux/cleanup.h    |   22 ++++++----
 kernel/time/posix-timers.c |   94 +++++++++++++++++----------------------------
 2 files changed, 51 insertions(+), 65 deletions(-)

--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@ -291,11 +291,21 @@ static inline class_##_name##_t class_##
 #define __DEFINE_CLASS_IS_CONDITIONAL(_name, _is_cond)	\
 static __maybe_unused const bool class_##_name##_is_conditional = _is_cond
 
-#define DEFINE_GUARD(_name, _type, _lock, _unlock) \
+#define __DEFINE_GUARD_LOCK_PTR(_name, _exp) \
+	static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
+	{ return (void *)(__force unsigned long)*(_exp); }
+
+#define DEFINE_CLASS_IS_GUARD(_name) \
 	__DEFINE_CLASS_IS_CONDITIONAL(_name, false); \
+	__DEFINE_GUARD_LOCK_PTR(_name, _T)
+
+#define DEFINE_CLASS_IS_COND_GUARD(_name) \
+	__DEFINE_CLASS_IS_CONDITIONAL(_name, true); \
+	__DEFINE_GUARD_LOCK_PTR(_name, _T)
+
+#define DEFINE_GUARD(_name, _type, _lock, _unlock) \
 	DEFINE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), _type _T); \
-	static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
-	{ return (void *)(__force unsigned long)*_T; }
+	DEFINE_CLASS_IS_GUARD(_name)
 
 #define DEFINE_GUARD_COND(_name, _ext, _condlock) \
 	__DEFINE_CLASS_IS_CONDITIONAL(_name##_ext, true); \
@@ -375,11 +385,7 @@ static inline void class_##_name##_destr
 	if (_T->lock) { _unlock; }					\
 }									\
 									\
-static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)	\
-{									\
-	return (void *)(__force unsigned long)_T->lock;			\
-}
-
+__DEFINE_GUARD_LOCK_PTR(_name, &_T->lock)
 
 #define __DEFINE_LOCK_GUARD_1(_name, _type, _lock)			\
 static inline class_##_name##_t class_##_name##_constructor(_type *l)	\
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -63,9 +63,18 @@ static struct k_itimer *__lock_timer(tim
 
 static inline void unlock_timer(struct k_itimer *timr)
 {
-	spin_unlock_irq(&timr->it_lock);
+	if (likely((timr)))
+		spin_unlock_irq(&timr->it_lock);
 }
 
+#define scoped_timer_get_or_fail(_id)					\
+	scoped_cond_guard(lock_timer, return -EINVAL, _id)
+
+#define scoped_timer				(scope)
+
+DEFINE_CLASS(lock_timer, struct k_itimer *, unlock_timer(_T), __lock_timer(id), timer_t id);
+DEFINE_CLASS_IS_COND_GUARD(lock_timer);
+
 static int hash(struct signal_struct *sig, unsigned int nr)
 {
 	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
@@ -682,18 +691,10 @@ void common_timer_get(struct k_itimer *t
 
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
-	struct k_itimer *timr;
-	int ret = 0;
-
-	timr = lock_timer(timer_id);
-	if (!timr)
-		return -EINVAL;
-
 	memset(setting, 0, sizeof(*setting));
-	timr->kclock->timer_get(timr, setting);
-
-	unlock_timer(timr);
-	return ret;
+	scoped_timer_get_or_fail(timer_id)
+		scoped_timer->kclock->timer_get(scoped_timer, setting);
+	return 0;
 }
 
 /* Get the time remaining on a POSIX.1b interval timer. */
@@ -747,17 +748,8 @@ SYSCALL_DEFINE2(timer_gettime32, timer_t
  */
 SYSCALL_DEFINE1(timer_getoverrun, timer_t, timer_id)
 {
-	struct k_itimer *timr;
-	int overrun;
-
-	timr = lock_timer(timer_id);
-	if (!timr)
-		return -EINVAL;
-
-	overrun = timer_overrun_to_int(timr);
-	unlock_timer(timr);
-
-	return overrun;
+	scoped_timer_get_or_fail(timer_id)
+		return timer_overrun_to_int(scoped_timer);
 }
 
 static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
@@ -875,12 +867,9 @@ int common_timer_set(struct k_itimer *ti
 	return 0;
 }
 
-static int do_timer_settime(timer_t timer_id, int tmr_flags,
-			    struct itimerspec64 *new_spec64,
+static int do_timer_settime(timer_t timer_id, int tmr_flags, struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	int ret;
-
 	if (!timespec64_valid(&new_spec64->it_interval) ||
 	    !timespec64_valid(&new_spec64->it_value))
 		return -EINVAL;
@@ -888,36 +877,28 @@ static int do_timer_settime(timer_t time
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	for (;;) {
-		struct k_itimer *timr = lock_timer(timer_id);
+	for (; ; old_spec64 = NULL) {
+		struct k_itimer *timr;
 
-		if (!timr)
-			return -EINVAL;
+		scoped_timer_get_or_fail(timer_id) {
+			timr = scoped_timer;
 
-		if (old_spec64)
-			old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
+			if (old_spec64)
+				old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
 
-		/* Prevent signal delivery and rearming. */
-		timr->it_signal_seq++;
-
-		ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
-		if (ret != TIMER_RETRY) {
-			unlock_timer(timr);
-			break;
-		}
+			/* Prevent signal delivery and rearming. */
+			timr->it_signal_seq++;
 
-		/* Read the old time only once */
-		old_spec64 = NULL;
-		/* Protect the timer from being freed after the lock is dropped */
-		guard(rcu)();
-		unlock_timer(timr);
-		/*
-		 * timer_wait_running() might drop RCU read side protection
-		 * so the timer has to be looked up again!
-		 */
+			int ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+			if (ret != TIMER_RETRY)
+				return ret;
+
+			/* Protect the timer from being freed when leaving the lock scope */
+			rcu_read_lock();
+		}
 		timer_wait_running(timr);
+		rcu_read_unlock();
 	}
-	return ret;
 }
 
 /* Set a POSIX.1b interval timer */
@@ -1028,13 +1009,12 @@ static void posix_timer_delete(struct k_
 /* Delete a POSIX.1b interval timer. */
 SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	struct k_itimer *timer = lock_timer(timer_id);
-
-	if (!timer)
-		return -EINVAL;
+	struct k_itimer *timer;
 
-	posix_timer_delete(timer);
-	unlock_timer(timer);
+	scoped_timer_get_or_fail(timer_id) {
+		timer = scoped_timer;
+		posix_timer_delete(timer);
+	}
 	/* Remove it from the hash, which frees up the timer ID */
 	posix_timer_unhash_and_free(timer);
 	return 0;


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (9 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 10/18] posix-timers: Make lock_timer() use guard() Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-10 22:57   ` Frederic Weisbecker
                     ` (2 more replies)
  2025-03-08 16:48 ` [patch V3 12/18] posix-timers: Improve hash table performance Thomas Gleixner
                   ` (6 subsequent siblings)
  17 siblings, 3 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

From: Eric Dumazet <edumazet@google.com>

The global hash_lock protecting the posix timer hash table can be heavily
contended especially when there is an extensive linear search for a timer
ID.

Timer IDs are handed out by monotonically increasing next_posix_timer_id
and then validating that there is no timer with the same ID in the hash
table. Both operations happen with the global hash lock held.

To reduce the hash lock contention the hash will be reworked to a scaled
hash with per bucket locks, which requires to handle the ID counter
lockless.

Prepare for this by making next_posix_timer_id an atomic_t, which can be
used lockless with atomic_inc_return().

[ tglx: Adopted from Eric's series, massaged change log and simplified it ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com

---
V2: Use atomic_fetch_inc() - PeterZ
---
 include/linux/sched/signal.h |    2 +-
 kernel/time/posix-timers.c   |   14 +++++---------
 2 files changed, 6 insertions(+), 10 deletions(-)

--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,7 +136,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
-	unsigned int		next_posix_timer_id;
+	atomic_t		next_posix_timer_id;
 	struct hlist_head	posix_timers;
 	struct hlist_head	ignored_posix_timers;
 
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -119,21 +119,17 @@ static bool posix_timer_hashed(struct hl
 static int posix_timer_add(struct k_itimer *timer)
 {
 	struct signal_struct *sig = current->signal;
-	struct hlist_head *head;
-	unsigned int cnt, id;
 
 	/*
 	 * FIXME: Replace this by a per signal struct xarray once there is
 	 * a plan to handle the resulting CRIU regression gracefully.
 	 */
-	for (cnt = 0; cnt <= INT_MAX; cnt++) {
-		spin_lock(&hash_lock);
-		id = sig->next_posix_timer_id;
-
-		/* Write the next ID back. Clamp it to the positive space */
-		sig->next_posix_timer_id = (id + 1) & INT_MAX;
+	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
+		/* Get the next timer ID and clamp it to positive space */
+		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
+		struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
 
-		head = &posix_timers_hashtable[hash(sig, id)];
+		spin_lock(&hash_lock);
 		if (!posix_timer_hashed(head, sig, id)) {
 			/*
 			 * Set the timer ID and the signal pointer to make


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 12/18] posix-timers: Improve hash table performance
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (10 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-11 13:44   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 13/18] posix-timers: Switch to jhash32() Thomas Gleixner
                   ` (5 subsequent siblings)
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Eric and Ben reported a significant performance bottleneck on the global
hash, which is used to store posix timers for lookup.

Eric tried to do a lockless validation of a new timer ID before trying to
insert the timer, but that does not solve the problem.

For the non-contended case this is a pointless exercise and for the
contended case this extra lookup just creates enough interleaving that all
tasks can make progress.

There are actually two real solutions to the problem:

  1) Provide a per process (signal struct) xarray storage

  2) Implement a smarter hash like the one in the futex code

#1 works perfectly fine for most cases, but the fact that CRIU enforced a
   linear increasing timer ID to restore timers makes this problematic.

   It's easy enough to create a sparse timer ID space, which amounts very
   fast to a large junk of memory consumed for the xarray. 2048 timers with
   a ID offset of 512 consume more than one megabyte of memory for the
   xarray storage.

#2 The main advantage of the futex hash is that it uses per hash bucket
   locks instead of a global hash lock. Aside of that it is scaled
   according to the number of CPUs at boot time.

Experiments with artifical benchmarks have shown that a scaled hash with
per bucket locks comes pretty close to the xarray performance and in some
scenarios it performes better.

Test 1:

     A single process creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create         23 ms       23 ms      9 ms     8 ms
getoverrun     14 ms       14 ms      5 ms     4 ms

Test 2:

     A single process creates 50000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create         98 ms      219 ms     20 ms    18 ms
getoverrun     62 ms       62 ms     10 ms     9 ms

Test 3:

     A single process creates 100000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create        313 ms      750 ms     48 ms    33 ms
getoverrun    261 ms      260 ms     20 ms    14 ms

Erics changes create quite some overhead in the create() path due to the
double list walk, as the main issue according to perf is the list walk
itself. With 100k timers each hash bucket contains ~200 timers, which in
the worst case need to be all inspected. The same problem applies for
getoverrun() where the lookup has to walk through the hash buckets to find
the timer it is looking for.

The scaled hash obviously reduces hash collisions and lock contention
significantly. This becomes more prominent with concurrency.

Test 4:

     A process creates 63 threads and all threads wait on a barrier before
     each instance creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them. The threads are pinned on
     seperate CPUs to achive maximum concurrency. The numbers are the
     average times per thread:

            mainline        Eric   newhash   xarray
create     180239 ms    38599 ms    579 ms   813 ms
getoverrun   2645 ms     2642 ms     32 ms     7 ms

Test 5:

     A process forks 63 times and all forks wait on a barrier before each
     instance creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them. The processes are pinned on
     seperate CPUs to achive maximum concurrency. The numbers are the
     average times per process:

            mainline        eric   newhash   xarray
create     157253 ms    40008 ms     83 ms    60 ms
getoverrun   2611 ms     2614 ms     40 ms     4 ms

So clearly the reduction of lock contention with Eric's changes makes a
significant difference for the create() loop, but it does not mitigate the
problem of long list walks, which is clearly visible on the getoverrun()
side because that is purely dominated by the lookup itself. Once the timer
is found, the syscall just reads from the timer structure with no other
locks or code paths involved and returns.

The reason for the difference between the thread and the fork case for the
new hash and the xarray is that both suffer from contention on
sighand::siglock and the xarray suffers additionally from contention on the
xarray lock on insertion.

The only case where the reworked hash slighly outperforms the xarray is a
tight loop which creates and deletes timers.

Test 4:

     A process creates 63 threads and all threads wait on a barrier before
     each instance runs a loop which creates and deletes a timer 100000
     times in a row. The threads are pinned on seperate CPUs to achive
     maximum concurrency. The numbers are the average times per thread:

            mainline        Eric   newhash   xarray
loop	    5917  ms	 5897 ms   5473 ms  7846 ms

Test 5:

     A process forks 63 times and all forks wait on a barrier before each
     each instance runs a loop which creates and deletes a timer 100000
     times in a row. The processes are pinned on seperate CPUs to achive
     maximum concurrency. The numbers are the average times per process:

            mainline        Eric   newhash   xarray
loop	     5137 ms	 7828 ms    891 ms   872 ms

In both test there is not much contention on the hash, but the ucount
accounting for the signal and in the thread case the sighand::siglock
contention (plus the xarray locking) contribute dominantly to the overhead.

As the memory consumption of the xarray in the sparse ID case is
significant, the scaled hash with per bucket locks seems to be the better
overall option. While the xarray has faster lookup times for a large number
of timers, the actual syscall usage, which requires the lookup is not an
extreme hotpath. Most applications utilize signal delivery and all syscalls
except timer_getoverrun(2) are all but cheap.

So implement a scaled hash with per bucket locks, which offers the best
tradeoff between performance and memory consumption.

Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V2: Replace hash() by hashbucket(), which returns the bucket pointer.
---
 kernel/time/posix-timers.c |   99 ++++++++++++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 31 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -12,10 +12,10 @@
 #include <linux/compat.h>
 #include <linux/compiler.h>
 #include <linux/hash.h>
-#include <linux/hashtable.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/list.h>
+#include <linux/memblock.h>
 #include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
@@ -40,8 +40,18 @@ static struct kmem_cache *posix_timers_c
  * This allows checkpoint/restore to reconstruct the exact timer IDs for
  * a process.
  */
-static DEFINE_HASHTABLE(posix_timers_hashtable, 9);
-static DEFINE_SPINLOCK(hash_lock);
+struct timer_hash_bucket {
+	spinlock_t		lock;
+	struct hlist_head	head;
+};
+
+static struct {
+	struct timer_hash_bucket	*buckets;
+	unsigned long			bits;
+} __timer_data __ro_after_init __aligned(2*sizeof(long));
+
+#define timer_buckets	(__timer_data.buckets)
+#define timer_hashbits	(__timer_data.bits)
 
 static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
@@ -75,18 +85,18 @@ static inline void unlock_timer(struct k
 DEFINE_CLASS(lock_timer, struct k_itimer *, unlock_timer(_T), __lock_timer(id), timer_t id);
 DEFINE_CLASS_IS_COND_GUARD(lock_timer);
 
-static int hash(struct signal_struct *sig, unsigned int nr)
+static struct timer_hash_bucket *hash_bucket(struct signal_struct *sig, unsigned int nr)
 {
-	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
+	return &timer_buckets[hash_32(hash32_ptr(sig) ^ nr, timer_hashbits)];
 }
 
 static struct k_itimer *posix_timer_by_id(timer_t id)
 {
 	struct signal_struct *sig = current->signal;
-	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+	struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 	struct k_itimer *timer;
 
-	hlist_for_each_entry_rcu(timer, head, t_hash) {
+	hlist_for_each_entry_rcu(timer, &bucket->head, t_hash) {
 		/* timer->it_signal can be set concurrently */
 		if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
 			return timer;
@@ -105,11 +115,13 @@ static inline struct signal_struct *posi
 	return (struct signal_struct *)(val & ~1UL);
 }
 
-static bool posix_timer_hashed(struct hlist_head *head, struct signal_struct *sig, timer_t id)
+static bool posix_timer_hashed(struct timer_hash_bucket *bucket, struct signal_struct *sig,
+			       timer_t id)
 {
+	struct hlist_head *head = &bucket->head;
 	struct k_itimer *timer;
 
-	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&bucket->lock)) {
 		if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
 			return true;
 	}
@@ -120,34 +132,34 @@ static int posix_timer_add(struct k_itim
 {
 	struct signal_struct *sig = current->signal;
 
-	/*
-	 * FIXME: Replace this by a per signal struct xarray once there is
-	 * a plan to handle the resulting CRIU regression gracefully.
-	 */
 	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
 		/* Get the next timer ID and clamp it to positive space */
 		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
-		struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+		struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 
-		spin_lock(&hash_lock);
-		if (!posix_timer_hashed(head, sig, id)) {
+		scoped_guard (spinlock, &bucket->lock) {
 			/*
-			 * Set the timer ID and the signal pointer to make
-			 * it identifiable in the hash table. The signal
-			 * pointer has bit 0 set to indicate that it is not
-			 * yet fully initialized. posix_timer_hashed()
-			 * masks this bit out, but the syscall lookup fails
-			 * to match due to it being set. This guarantees
-			 * that there can't be duplicate timer IDs handed
-			 * out.
+			 * Validate under the lock as this could have raced
+			 * against another thread ending up with the same
+			 * ID, which is highly unlikely, but possible.
 			 */
-			timer->it_id = (timer_t)id;
-			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
-			hlist_add_head_rcu(&timer->t_hash, head);
-			spin_unlock(&hash_lock);
-			return id;
+			if (!posix_timer_hashed(bucket, sig, id)) {
+				/*
+				 * Set the timer ID and the signal pointer to make
+				 * it identifiable in the hash table. The signal
+				 * pointer has bit 0 set to indicate that it is not
+				 * yet fully initialized. posix_timer_hashed()
+				 * masks this bit out, but the syscall lookup fails
+				 * to match due to it being set. This guarantees
+				 * that there can't be duplicate timer IDs handed
+				 * out.
+				 */
+				timer->it_id = (timer_t)id;
+				timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
+				hlist_add_head_rcu(&timer->t_hash, &bucket->head);
+				return id;
+			}
 		}
-		spin_unlock(&hash_lock);
 		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
@@ -405,7 +417,9 @@ void posixtimer_free_timer(struct k_itim
 
 static void posix_timer_unhash_and_free(struct k_itimer *tmr)
 {
-	scoped_guard (spinlock, &hash_lock)
+	struct timer_hash_bucket *bucket = hash_bucket(posix_sig_owner(tmr), tmr->it_id);
+
+	scoped_guard (spinlock, &bucket->lock)
 		hlist_del_rcu(&tmr->t_hash);
 	posixtimer_putref(tmr);
 }
@@ -1486,3 +1500,26 @@ static const struct k_clock *clockid_to_
 
 	return posix_clocks[array_index_nospec(idx, ARRAY_SIZE(posix_clocks))];
 }
+
+static int __init posixtimer_init(void)
+{
+	unsigned long i, size;
+	unsigned int shift;
+
+	if (IS_ENABLED(CONFIG_BASE_SMALL))
+		size = 512;
+	else
+		size = roundup_pow_of_two(512 * num_possible_cpus());
+
+	timer_buckets = alloc_large_system_hash("posixtimers", sizeof(*timer_buckets),
+						size, 0, 0, &shift, NULL, size, size);
+	size = 1UL << shift;
+	timer_hashbits = ilog2(size);
+
+	for (i = 0; i < size; i++) {
+		spin_lock_init(&timer_buckets[i].lock);
+		INIT_HLIST_HEAD(&timer_buckets[i].head);
+	}
+	return 0;
+}
+core_initcall(posixtimer_init);


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 13/18] posix-timers: Switch to jhash32()
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (11 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 12/18] posix-timers: Improve hash table performance Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

The hash distribution of hash_32() is suboptimal. jhash32() provides a way
better distribution, which evens out the length of the hash bucket lists,
which in turn avoids large outliers in list walk times.

Due to the sparse ID space (thanks CRIU) there is no guarantee that the
timers will be fully evenly distributed over the hash buckets, but the
behaviour is way better than with hash_32() even for randomly sparse ID
spaces.

For a pathological test case with 64 processes creating and accessing
20000 timers each, this results in a runtime reduction of ~10% and a
significantly reduced runtime variation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V2: New patch
---
 kernel/time/posix-timers.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -11,8 +11,8 @@
  */
 #include <linux/compat.h>
 #include <linux/compiler.h>
-#include <linux/hash.h>
 #include <linux/init.h>
+#include <linux/jhash.h>
 #include <linux/interrupt.h>
 #include <linux/list.h>
 #include <linux/memblock.h>
@@ -47,11 +47,11 @@ struct timer_hash_bucket {
 
 static struct {
 	struct timer_hash_bucket	*buckets;
-	unsigned long			bits;
+	unsigned long			mask;
 } __timer_data __ro_after_init __aligned(2*sizeof(long));
 
 #define timer_buckets	(__timer_data.buckets)
-#define timer_hashbits	(__timer_data.bits)
+#define timer_hashmask	(__timer_data.mask)
 
 static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
@@ -87,7 +87,7 @@ DEFINE_CLASS_IS_COND_GUARD(lock_timer);
 
 static struct timer_hash_bucket *hash_bucket(struct signal_struct *sig, unsigned int nr)
 {
-	return &timer_buckets[hash_32(hash32_ptr(sig) ^ nr, timer_hashbits)];
+	return &timer_buckets[jhash2((u32 *)&sig, sizeof(sig) / sizeof(u32), nr) & timer_hashmask];
 }
 
 static struct k_itimer *posix_timer_by_id(timer_t id)
@@ -1514,7 +1514,7 @@ static int __init posixtimer_init(void)
 	timer_buckets = alloc_large_system_hash("posixtimers", sizeof(*timer_buckets),
 						size, 0, 0, &shift, NULL, size, size);
 	size = 1UL << shift;
-	timer_hashbits = ilog2(size);
+	timer_hashmask = size - 1;
 
 	for (i = 0; i < size; i++) {
 		spin_lock_init(&timer_buckets[i].lock);


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 14/18] posix-timers: Avoid false cacheline sharing
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (12 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 13/18] posix-timers: Switch to jhash32() Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-11 13:53   ` Frederic Weisbecker
                     ` (3 more replies)
  2025-03-08 16:48 ` [patch V3 15/18] posix-timers: Make per process list RCU safe Thomas Gleixner
                   ` (3 subsequent siblings)
  17 siblings, 4 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

struct k_itimer has the hlist_node, which is used for lookup in the hash
bucket, and the timer lock in the same cache line.

That's obviously bad, if one CPU fiddles with a timer and the other is
walking the hash bucket on which that timer is queued.

Avoid this by restructuring struct k_itimer, so that the read mostly (only
modified during setup and teardown) fields are in the first cache line and
the lock and the rest of the fields which get written to are in cacheline
2-N.

Reduces cacheline contention in a test case of 64 processes creating and
accessing 20000 timers each by almost 30% according to perf.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V2: New patch
---
 include/linux/posix-timers.h |   21 ++++++++++++---------
 kernel/time/posix-timers.c   |    4 ++--
 2 files changed, 14 insertions(+), 11 deletions(-)

--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -177,23 +177,26 @@ static inline void posix_cputimers_init_
  * @rcu:		RCU head for freeing the timer.
  */
 struct k_itimer {
-	struct hlist_node	list;
-	struct hlist_node	ignored_list;
+	/* 1st cacheline contains read-mostly fields */
 	struct hlist_node	t_hash;
-	spinlock_t		it_lock;
-	const struct k_clock	*kclock;
-	clockid_t		it_clock;
+	struct hlist_node	list;
 	timer_t			it_id;
+	clockid_t		it_clock;
+	int			it_sigev_notify;
+	enum pid_type		it_pid_type;
+	struct signal_struct	*it_signal;
+	const struct k_clock	*kclock;
+
+	/* 2nd cacheline and above contain fields which are modified regularly */
+	spinlock_t		it_lock;
 	int			it_status;
 	bool			it_sig_periodic;
 	s64			it_overrun;
 	s64			it_overrun_last;
 	unsigned int		it_signal_seq;
 	unsigned int		it_sigqueue_seq;
-	int			it_sigev_notify;
-	enum pid_type		it_pid_type;
 	ktime_t			it_interval;
-	struct signal_struct	*it_signal;
+	struct hlist_node	ignored_list;
 	union {
 		struct pid		*it_pid;
 		struct task_struct	*it_process;
@@ -210,7 +213,7 @@ struct k_itimer {
 		} alarm;
 	} it;
 	struct rcu_head		rcu;
-};
+} ____cacheline_aligned_in_smp;
 
 void run_posix_cpu_timers(void);
 void posix_cpu_timers_exit(struct task_struct *task);
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -260,8 +260,8 @@ static int posix_get_hrtimer_res(clockid
 
 static __init int init_posix_timers(void)
 {
-	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer), 0,
-					       SLAB_ACCOUNT, NULL);
+	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer),
+					       __alignof__(struct k_itimer), SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 15/18] posix-timers: Make per process list RCU safe
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (13 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-11 15:29   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held Thomas Gleixner
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Preparatory change to remove the sighand locking from the /proc/$PID/timers
iterator.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/time/posix-timers.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -518,7 +518,7 @@ static int do_timer_create(clockid_t whi
 		 * Store the unmodified signal pointer to make it valid.
 		 */
 		WRITE_ONCE(new_timer->it_signal, current->signal);
-		hlist_add_head(&new_timer->list, &current->signal->posix_timers);
+		hlist_add_head_rcu(&new_timer->list, &current->signal->posix_timers);
 	}
 	/*
 	 * After unlocking @new_timer is subject to concurrent removal and
@@ -1004,7 +1004,7 @@ static void posix_timer_delete(struct k_
 		unsigned long sig = (unsigned long)timer->it_signal | 1UL;
 
 		WRITE_ONCE(timer->it_signal, (struct signal_struct *)sig);
-		hlist_del(&timer->list);
+		hlist_del_rcu(&timer->list);
 		posix_timer_cleanup_ignored(timer);
 	}
 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (14 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 15/18] posix-timers: Make per process list RCU safe Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-08 22:38   ` Cyrill Gorcunov
                     ` (2 more replies)
  2025-03-08 16:48 ` [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID Thomas Gleixner
  2025-03-08 16:48 ` [patch V3 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode Thomas Gleixner
  17 siblings, 3 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

The readout of /proc/$PID/timers holds sighand::siglock with interrupts
disabled. That is required to protect against concurrent modifications of
the task::signal::posix_timers list because the list is not RCU safe.

With the conversion of the timer storage to a RCU protected hlist, this is
not longer required.

The only requirement is to protect the returned entry against a concurrent
free, which is trivial as the timers are RCU protected.

Removing the trylock of sighand::siglock is benign because the life time of
task_struct::signal is bound to the life time of the task_struct itself.

There are two scenarios where this matters:

  1) The process is life and not about to be checkpointed

  2) The process is stopped via ptrace for checkpointing

#1 is a racy snapshot of the armed timers and nothing can rely on it. It's
   not more than debug information and it has been that way before because
   sighand lock is dropped when the buffer is full and the restart of
   the iteration might find a completely different set of timers.

   The task and therefore task::signal cannot be freed as timers_start()
   acquired a reference count via get_pid_task().

#2 the process is stopped for checkpointing so nothing can delete or create
   timers at this point. Neither can the process exit during the traversal.

   If CRIU fails to observe an exit in progress prior to the dissimination
   of the timers, then there are more severe problems to solve in the CRIU
   mechanics as they can't rely on posix timers being enabled in the first
   place.

Therefore replace the lock acquisition with rcu_read_lock() and switch the
timer storage traversal over to seq_hlist_*_rcu().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 fs/proc/base.c |   48 ++++++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 28 deletions(-)

--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2497,11 +2497,9 @@ static const struct file_operations proc
 
 #if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_POSIX_TIMERS)
 struct timers_private {
-	struct pid *pid;
-	struct task_struct *task;
-	struct sighand_struct *sighand;
-	struct pid_namespace *ns;
-	unsigned long flags;
+	struct pid		*pid;
+	struct task_struct	*task;
+	struct pid_namespace	*ns;
 };
 
 static void *timers_start(struct seq_file *m, loff_t *pos)
@@ -2512,54 +2510,48 @@ static void *timers_start(struct seq_fil
 	if (!tp->task)
 		return ERR_PTR(-ESRCH);
 
-	tp->sighand = lock_task_sighand(tp->task, &tp->flags);
-	if (!tp->sighand)
-		return ERR_PTR(-ESRCH);
-
-	return seq_hlist_start(&tp->task->signal->posix_timers, *pos);
+	rcu_read_lock();
+	return seq_hlist_start_rcu(&tp->task->signal->posix_timers, *pos);
 }
 
 static void *timers_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct timers_private *tp = m->private;
-	return seq_hlist_next(v, &tp->task->signal->posix_timers, pos);
+
+	return seq_hlist_next_rcu(v, &tp->task->signal->posix_timers, pos);
 }
 
 static void timers_stop(struct seq_file *m, void *v)
 {
 	struct timers_private *tp = m->private;
 
-	if (tp->sighand) {
-		unlock_task_sighand(tp->task, &tp->flags);
-		tp->sighand = NULL;
-	}
-
 	if (tp->task) {
 		put_task_struct(tp->task);
 		tp->task = NULL;
+		rcu_read_unlock();
 	}
 }
 
 static int show_timer(struct seq_file *m, void *v)
 {
-	struct k_itimer *timer;
-	struct timers_private *tp = m->private;
-	int notify;
 	static const char * const nstr[] = {
-		[SIGEV_SIGNAL] = "signal",
-		[SIGEV_NONE] = "none",
-		[SIGEV_THREAD] = "thread",
+		[SIGEV_SIGNAL]	= "signal",
+		[SIGEV_NONE]	= "none",
+		[SIGEV_THREAD]	= "thread",
 	};
 
-	timer = hlist_entry((struct hlist_node *)v, struct k_itimer, list);
-	notify = timer->it_sigev_notify;
+	struct k_itimer *timer = hlist_entry((struct hlist_node *)v, struct k_itimer, list);
+	struct timers_private *tp = m->private;
+	int notify = timer->it_sigev_notify;
+
+	guard(spinlock_irq)(&timer->it_lock);
+	if (!posixtimer_valid(timer))
+		return 0;
 
 	seq_printf(m, "ID: %d\n", timer->it_id);
-	seq_printf(m, "signal: %d/%px\n",
-		   timer->sigq.info.si_signo,
+	seq_printf(m, "signal: %d/%px\n", timer->sigq.info.si_signo,
 		   timer->sigq.info.si_value.sival_ptr);
-	seq_printf(m, "notify: %s/%s.%d\n",
-		   nstr[notify & ~SIGEV_THREAD_ID],
+	seq_printf(m, "notify: %s/%s.%d\n", nstr[notify & ~SIGEV_THREAD_ID],
 		   (notify & SIGEV_THREAD_ID) ? "tid" : "pid",
 		   pid_nr_ns(timer->it_pid, tp->ns));
 	seq_printf(m, "ClockID: %d\n", timer->it_clock);


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (15 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-08 22:25   ` Cyrill Gorcunov
  2025-03-11 21:35   ` Frederic Weisbecker
  2025-03-08 16:48 ` [patch V3 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode Thomas Gleixner
  17 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
with the same timer ID on restore. It uses sys_timer_create() and relies on
the monotonic increasing timer ID provided by this syscall. It creates and
deletes timers until the desired ID is reached. This is can loop for a long
time, when the checkpointed process had a very sparse timer ID range.

It has been debated to implement a new syscall to allow the creation of
timers with a given timer ID, but that's tideous due to the 32/64bit compat
issues of sigevent_t and of dubious value.

The restore mechanism of CRIU creates the timers in a state where all
threads of the restored process are held on a barrier and cannot issue
syscalls. That means the restorer task has exclusive control.

This allows to address this issue with a prctl() so that the restorer
thread can do:

   if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
      goto linear_mode;
   create_timers_with_explicit_ids();
   prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
   
This is backwards compatible because the prctl() fails on older kernels and
CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
not know about the prctl() just work as before.

Implement the prctl() and modify timer_create() so that it copies the
requested timer ID from userspace by utilizing the existing timer_t
pointer, which is used to copy out the allocated timer ID on success.

If the prctl() is disabled, which it is by default, timer_create() works as
before and does not try to read from the userspace pointer.

There is no problem when a broken or rogue user space application enables
the prctl(). If the user space pointer does not contain a valid ID, then
timer_create() fails. If the data is not initialized, but constains a
random valid ID, timer_create() will create that random timer ID or fail if
the ID is already given out. 
 
As CRIU must use the raw syscall to avoid manipulating the internal state
of the restored process, this has no library dependencies and can be
adopted by CRIU right away.

Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
the create/delete method. With the prctl() it takes 3 microseconds.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V2: Move the ID counter ahead to avoid collisions after switching back to
    normal mode.
---
 include/linux/posix-timers.h |    2 
 include/linux/sched/signal.h |    1 
 include/uapi/linux/prctl.h   |   10 ++++
 kernel/sys.c                 |    5 ++
 kernel/time/posix-timers.c   |   97 +++++++++++++++++++++++++++++++------------
 5 files changed, 89 insertions(+), 26 deletions(-)

--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -114,6 +114,7 @@ bool posixtimer_init_sigqueue(struct sig
 void posixtimer_send_sigqueue(struct k_itimer *tmr);
 bool posixtimer_deliver_signal(struct kernel_siginfo *info, struct sigqueue *timer_sigq);
 void posixtimer_free_timer(struct k_itimer *timer);
+long posixtimer_create_prctl(unsigned long ctrl);
 
 /* Init task static initializer */
 #define INIT_CPU_TIMERBASE(b) {						\
@@ -140,6 +141,7 @@ static inline void posixtimer_rearm_itim
 static inline bool posixtimer_deliver_signal(struct kernel_siginfo *info,
 					     struct sigqueue *timer_sigq) { return false; }
 static inline void posixtimer_free_timer(struct k_itimer *timer) { }
+static inline long posixtimer_create_prctl(unsigned long ctrl) { return -EINVAL; }
 #endif
 
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,6 +136,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
+	unsigned int		timer_create_restore_ids:1;
 	atomic_t		next_posix_timer_id;
 	struct hlist_head	posix_timers;
 	struct hlist_head	ignored_posix_timers;
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,14 @@ struct prctl_mm_map {
  */
 #define PR_LOCK_SHADOW_STACK_STATUS      76
 
+/*
+ * Controls the mode of timer_create() for CRIU restore operations.
+ * Enabling this allows CRIU to restore timers with explicit IDs.
+ *
+ * Don't use for normal operations as the result might be undefined.
+ */
+#define PR_TIMER_CREATE_RESTORE_IDS		77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF	0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON		1
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2811,6 +2811,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			return -EINVAL;
 		error = arch_lock_shadow_stack_status(me, arg2);
 		break;
+	case PR_TIMER_CREATE_RESTORE_IDS:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = posixtimer_create_prctl(arg2);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -19,6 +19,7 @@
 #include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
+#include <linux/prctl.h>
 #include <linux/sched/task.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
@@ -57,6 +58,8 @@ static const struct k_clock * const posi
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
 static const struct k_clock clock_realtime, clock_monotonic;
 
+#define TIMER_ANY_ID		INT_MIN
+
 /* SIGEV_THREAD_ID cannot share a bit with the other SIGEV values. */
 #if SIGEV_THREAD_ID != (SIGEV_THREAD_ID & \
 			~(SIGEV_SIGNAL | SIGEV_NONE | SIGEV_THREAD))
@@ -128,38 +131,60 @@ static bool posix_timer_hashed(struct ti
 	return false;
 }
 
-static int posix_timer_add(struct k_itimer *timer)
+static bool posix_timer_add_at(struct k_itimer *timer, struct signal_struct *sig, unsigned int id)
+{
+	struct timer_hash_bucket *bucket = hash_bucket(sig, id);
+
+	scoped_guard (spinlock, &bucket->lock) {
+		/*
+		 * Validate under the lock as this could have raced against
+		 * another thread ending up with the same ID, which is
+		 * highly unlikely, but possible.
+		 */
+		if (!posix_timer_hashed(bucket, sig, id)) {
+			/*
+			 * Set the timer ID and the signal pointer to make
+			 * it identifiable in the hash table. The signal
+			 * pointer has bit 0 set to indicate that it is not
+			 * yet fully initialized. posix_timer_hashed()
+			 * masks this bit out, but the syscall lookup fails
+			 * to match due to it being set. This guarantees
+			 * that there can't be duplicate timer IDs handed
+			 * out.
+			 */
+			timer->it_id = (timer_t)id;
+			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
+			hlist_add_head_rcu(&timer->t_hash, &bucket->head);
+			return true;
+		}
+	}
+	return false;
+}
+
+static int posix_timer_add(struct k_itimer *timer, int req_id)
 {
 	struct signal_struct *sig = current->signal;
 
+	if (unlikely(req_id != TIMER_ANY_ID)) {
+		if (!posix_timer_add_at(timer, sig, req_id))
+			return -EBUSY;
+
+		/*
+		 * Move the ID counter past the requested ID, so that after
+		 * switching back to normal mode the IDs are outside of the
+		 * exact allocated region. That avoids ID collisions on the
+		 * next regular timer_create() invocations.
+		 */
+		atomic_set(&sig->next_posix_timer_id, req_id + 1);
+		return req_id;
+	}
+
 	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
 		/* Get the next timer ID and clamp it to positive space */
 		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
-		struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 
-		scoped_guard (spinlock, &bucket->lock) {
-			/*
-			 * Validate under the lock as this could have raced
-			 * against another thread ending up with the same
-			 * ID, which is highly unlikely, but possible.
-			 */
-			if (!posix_timer_hashed(bucket, sig, id)) {
-				/*
-				 * Set the timer ID and the signal pointer to make
-				 * it identifiable in the hash table. The signal
-				 * pointer has bit 0 set to indicate that it is not
-				 * yet fully initialized. posix_timer_hashed()
-				 * masks this bit out, but the syscall lookup fails
-				 * to match due to it being set. This guarantees
-				 * that there can't be duplicate timer IDs handed
-				 * out.
-				 */
-				timer->it_id = (timer_t)id;
-				timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
-				hlist_add_head_rcu(&timer->t_hash, &bucket->head);
-				return id;
-			}
-		}
+		if (posix_timer_add_at(timer, sig, id))
+			return id;
 		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
@@ -364,6 +389,16 @@ static enum hrtimer_restart posix_timer_
 	return HRTIMER_NORESTART;
 }
 
+long posixtimer_create_prctl(unsigned long ctrl)
+{
+	if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)
+		return -EINVAL;
+
+	guard(spinlock_irq)(&current->sighand->siglock);
+	current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;
+	return 0;
+}
+
 static struct pid *good_sigevent(sigevent_t * event)
 {
 	struct pid *pid = task_tgid(current);
@@ -435,6 +470,7 @@ static int do_timer_create(clockid_t whi
 			   timer_t __user *created_timer_id)
 {
 	const struct k_clock *kc = clockid_to_kclock(which_clock);
+	timer_t req_id = TIMER_ANY_ID;
 	struct k_itimer *new_timer;
 	int error, new_timer_id;
 
@@ -449,11 +485,20 @@ static int do_timer_create(clockid_t whi
 
 	spin_lock_init(&new_timer->it_lock);
 
+	/* Special case for CRIU to restore timers with a given timer ID. */
+	if (unlikely(current->signal->timer_create_restore_ids)) {
+		if (copy_from_user(&req_id, created_timer_id, sizeof(req_id)))
+			return -EFAULT;
+		/* Valid IDs are 0..INT_MAX */
+		if ((unsigned int)req_id > INT_MAX)
+			return -EINVAL;
+	}
+
 	/*
 	 * Add the timer to the hash table. The timer is not yet valid
 	 * after insertion, but has a unique ID allocated.
 	 */
-	new_timer_id = posix_timer_add(new_timer);
+	new_timer_id = posix_timer_add(new_timer, req_id);
 	if (new_timer_id < 0) {
 		posixtimer_free_timer(new_timer);
 		return new_timer_id;


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode
  2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
                   ` (16 preceding siblings ...)
  2025-03-08 16:48 ` [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID Thomas Gleixner
@ 2025-03-08 16:48 ` Thomas Gleixner
  2025-03-10  8:11   ` [patch V3a " Thomas Gleixner
  17 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-08 16:48 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

The exact timer ID allocation mode is used by CRIU to restore timers with a
given ID. Add a test case for it.

It's skipped on older kernels when the prctl() fails.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
V3: Use the PRCTL defines
V2: Adopt to the ID counter change in the exact mode case
---
 tools/testing/selftests/timers/posix_timers.c |   66 +++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -7,6 +7,7 @@
  * Kernel loop code stolen from Steven Rostedt <srostedt@redhat.com>
  */
 #define _GNU_SOURCE
+#include <sys/prctl.h>
 #include <sys/time.h>
 #include <sys/types.h>
 #include <stdio.h>
@@ -599,14 +600,77 @@ static void check_overrun(int which, con
 			 "check_overrun %s\n", name);
 }
 
+#include <sys/syscall.h>
+
+static int do_timer_create(int *id)
+{
+	return syscall(__NR_timer_create, CLOCK_MONOTONIC, NULL, id);
+}
+
+static int do_timer_delete(int id)
+{
+	return syscall(__NR_timer_delete, id);
+}
+
+#ifndef define PR_TIMER_CREATE_RESTORE_IDS
+# define PR_TIMER_CREATE_RESTORE_IDS		77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF	 0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON		 1
+#endif
+
+static void check_timer_create_exact(void)
+{
+	int id;
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON, 0, 0, 0)) {
+		switch (errno) {
+		case EINVAL:
+			ksft_test_result_skip("check timer create exact, not supported\n");
+			return;
+		default:
+			ksft_test_result_skip("check timer create exact, errno = %d\n", errno);
+			return;
+		}
+	}
+
+	id = 8;
+	if (do_timer_create(&id) < 0)
+		fatal_error(NULL, "timer_create()");
+
+	if (do_timer_delete(id))
+		fatal_error(NULL, "timer_delete()");
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF, 0, 0, 0))
+		fatal_error(NULL, "prctl()");
+
+	if (id != 8) {
+		ksft_test_result_fail("check timer create exact %d != 8\n", id);
+		return;
+	}
+
+	/* Validate that it went back to normal mode and allocates ID 9 */
+	if (do_timer_create(&id) < 0)
+		fatal_error(NULL, "timer_create()");
+
+	if (do_timer_delete(id))
+		fatal_error(NULL, "timer_delete()");
+
+	if (id == 9)
+		ksft_test_result_pass("check timer create exact\n");
+	else
+		ksft_test_result_fail("check timer create exact. Disabling failed.\n");
+}
+
 int main(int argc, char **argv)
 {
 	ksft_print_header();
-	ksft_set_plan(18);
+	ksft_set_plan(19);
 
 	ksft_print_msg("Testing posix timers. False negative may happen on CPU execution \n");
 	ksft_print_msg("based timers if other threads run on the CPU...\n");
 
+	check_timer_create_exact();
+
 	check_itimer(ITIMER_VIRTUAL, "ITIMER_VIRTUAL");
 	check_itimer(ITIMER_PROF, "ITIMER_PROF");
 	check_itimer(ITIMER_REAL, "ITIMER_REAL");


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible
  2025-03-08 16:48 ` [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible Thomas Gleixner
@ 2025-03-08 21:39   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-08 21:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:10PM +0100, Thomas Gleixner a écrit :
> Frederic pointed out that the memory operations to initialize the timer are
> not guaranteed to be visible, when __lock_timer() observes timer::it_signal
> valid under timer::it_lock:
> 
>   T0                                      T1
>   ---------                               -----------
>   do_timer_create()
>       // A
>       new_timer->.... = ....
>       spin_lock(current->sighand)
>       // B
>       WRITE_ONCE(new_timer->it_signal, current->signal)
>       spin_unlock(current->sighand)
> 					sys_timer_*()
> 					   t =  __lock_timer()
> 						  spin_lock(&timr->it_lock)
> 						  // observes B
> 						  if (timr->it_signal == current->signal)
> 						    return timr;
> 			                   if (!t)
> 					       return;
> 					// Is not guaranteed to observe A
> 
> Protect the write of timer::it_signal, which makes the timer valid, with
> timer::it_lock as well. This guarantees that T1 must observe the
> initialization A completely, when it observes the valid signal pointer
> under timer::it_lock. sighand::siglock must still be taken to protect the
> signal::posix_timers list.
> 
> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> Suggested-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-08 16:48 ` [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID Thomas Gleixner
@ 2025-03-08 22:25   ` Cyrill Gorcunov
  2025-03-11 21:35   ` Frederic Weisbecker
  1 sibling, 0 replies; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-03-08 22:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Sat, Mar 08, 2025 at 05:48:47PM +0100, Thomas Gleixner wrote:
> Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
> with the same timer ID on restore. It uses sys_timer_create() and relies on
> the monotonic increasing timer ID provided by this syscall. It creates and
> deletes timers until the desired ID is reached. This is can loop for a long
> time, when the checkpointed process had a very sparse timer ID range.
...
(I've reran test with new series)

Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Tested-by: Cyrill Gorcunov <gorcunov@gmail.com>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
  2025-03-08 16:48 ` [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held Thomas Gleixner
@ 2025-03-08 22:38   ` Cyrill Gorcunov
  2025-03-11 15:26   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-03-08 22:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Sat, Mar 08, 2025 at 05:48:45PM +0100, Thomas Gleixner wrote:
...
>  
>  static int show_timer(struct seq_file *m, void *v)
>  {
> -	struct k_itimer *timer;
> -	struct timers_private *tp = m->private;
> -	int notify;
>  	static const char * const nstr[] = {
> +		[SIGEV_SIGNAL]	= "signal",
> +		[SIGEV_NONE]	= "none",
> +		[SIGEV_THREAD]	= "thread",
>  	};
>  
...
> -	seq_printf(m, "notify: %s/%s.%d\n",
> -		   nstr[notify & ~SIGEV_THREAD_ID],
> +	seq_printf(m, "notify: %s/%s.%d\n", nstr[notify & ~SIGEV_THREAD_ID],
>  		   (notify & SIGEV_THREAD_ID) ? "tid" : "pid",
>  		   pid_nr_ns(timer->it_pid, tp->ns));
...

Btw this nstr[notify & ~SIGEV_THREAD_ID] has been always fishy since ~SIGEV_THREAD_ID
doesn't give a proper mask over nstr size :-) It just happen to work but if for some
reason ::it_sigev_notify get screwed we will get a surprise. I think later (not in this
series) we better provide an explicit bitwise mask here.

	Cyrill

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 09/18] posix-timers: Rework timer removal
  2025-03-08 16:48 ` [patch V3 09/18] posix-timers: Rework timer removal Thomas Gleixner
@ 2025-03-09 23:17   ` Frederic Weisbecker
  2025-03-10  6:33     ` Thomas Gleixner
  2025-03-10  8:13   ` [patch V3a " Thomas Gleixner
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-09 23:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:32PM +0100, Thomas Gleixner a écrit :
> @@ -988,90 +988,56 @@ static inline void posix_timer_cleanup_i
>  	}
>  }
>  
> -/* Delete a POSIX.1b interval timer. */
> -SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
> +static void posix_timer_delete(struct k_itimer *timer)
>  {
> -	struct k_itimer *timer = lock_timer(timer_id);
> -
> -retry_delete:
> -	if (!timer)
> -		return -EINVAL;
> -
> -	/* Prevent signal delivery and rearming. */
> +	/*
> +	 * Invalidate the timer, remove it from the linked list and remove
> +	 * it from the ignored list if pending.
> +	 *
> +	 * The invalidation must be written with siglock held so that the
> +	 * signal code observes timer->it_valid == false in do_sigaction(),

I guess it_valid is a leftover from previous attempts?
Aside that and the lost WARN_ON in signal delivery:

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 09/18] posix-timers: Rework timer removal
  2025-03-09 23:17   ` Frederic Weisbecker
@ 2025-03-10  6:33     ` Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-10  6:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

On Mon, Mar 10 2025 at 00:17, Frederic Weisbecker wrote:
> Le Sat, Mar 08, 2025 at 05:48:32PM +0100, Thomas Gleixner a écrit :
>> @@ -988,90 +988,56 @@ static inline void posix_timer_cleanup_i
>>  	}
>>  }
>>  
>> -/* Delete a POSIX.1b interval timer. */
>> -SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
>> +static void posix_timer_delete(struct k_itimer *timer)
>>  {
>> -	struct k_itimer *timer = lock_timer(timer_id);
>> -
>> -retry_delete:
>> -	if (!timer)
>> -		return -EINVAL;
>> -
>> -	/* Prevent signal delivery and rearming. */
>> +	/*
>> +	 * Invalidate the timer, remove it from the linked list and remove
>> +	 * it from the ignored list if pending.
>> +	 *
>> +	 * The invalidation must be written with siglock held so that the
>> +	 * signal code observes timer->it_valid == false in do_sigaction(),
>
> I guess it_valid is a leftover from previous attempts?

Ooops, yes. Fixed now.

> Aside that and the lost WARN_ON in signal delivery:
>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3a 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode
  2025-03-08 16:48 ` [patch V3 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode Thomas Gleixner
@ 2025-03-10  8:11   ` Thomas Gleixner
  2025-03-11 21:44     ` Frederic Weisbecker
  2025-03-13 11:31     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  0 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-10  8:11 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

The exact timer ID allocation mode is used by CRIU to restore timers with a
given ID. Add a test case for it.

It's skipped on older kernels when the prctl() fails.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3a: Fix #ifndef condition
V3: Use the PRCTL defines
V2: Adopt to the ID counter change in the exact mode case
---
 tools/testing/selftests/timers/posix_timers.c |   66 +++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -7,6 +7,7 @@
  * Kernel loop code stolen from Steven Rostedt <srostedt@redhat.com>
  */
 #define _GNU_SOURCE
+#include <sys/prctl.h>
 #include <sys/time.h>
 #include <sys/types.h>
 #include <stdio.h>
@@ -599,14 +600,77 @@ static void check_overrun(int which, con
 			 "check_overrun %s\n", name);
 }
 
+#include <sys/syscall.h>
+
+static int do_timer_create(int *id)
+{
+	return syscall(__NR_timer_create, CLOCK_MONOTONIC, NULL, id);
+}
+
+static int do_timer_delete(int id)
+{
+	return syscall(__NR_timer_delete, id);
+}
+
+#ifndef PR_TIMER_CREATE_RESTORE_IDS
+# define PR_TIMER_CREATE_RESTORE_IDS		77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF	 0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON		 1
+#endif
+
+static void check_timer_create_exact(void)
+{
+	int id;
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON, 0, 0, 0)) {
+		switch (errno) {
+		case EINVAL:
+			ksft_test_result_skip("check timer create exact, not supported\n");
+			return;
+		default:
+			ksft_test_result_skip("check timer create exact, errno = %d\n", errno);
+			return;
+		}
+	}
+
+	id = 8;
+	if (do_timer_create(&id) < 0)
+		fatal_error(NULL, "timer_create()");
+
+	if (do_timer_delete(id))
+		fatal_error(NULL, "timer_delete()");
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF, 0, 0, 0))
+		fatal_error(NULL, "prctl()");
+
+	if (id != 8) {
+		ksft_test_result_fail("check timer create exact %d != 8\n", id);
+		return;
+	}
+
+	/* Validate that it went back to normal mode and allocates ID 9 */
+	if (do_timer_create(&id) < 0)
+		fatal_error(NULL, "timer_create()");
+
+	if (do_timer_delete(id))
+		fatal_error(NULL, "timer_delete()");
+
+	if (id == 9)
+		ksft_test_result_pass("check timer create exact\n");
+	else
+		ksft_test_result_fail("check timer create exact. Disabling failed.\n");
+}
+
 int main(int argc, char **argv)
 {
 	ksft_print_header();
-	ksft_set_plan(18);
+	ksft_set_plan(19);
 
 	ksft_print_msg("Testing posix timers. False negative may happen on CPU execution \n");
 	ksft_print_msg("based timers if other threads run on the CPU...\n");
 
+	check_timer_create_exact();
+
 	check_itimer(ITIMER_VIRTUAL, "ITIMER_VIRTUAL");
 	check_itimer(ITIMER_PROF, "ITIMER_PROF");
 	check_itimer(ITIMER_REAL, "ITIMER_REAL");

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3a 09/18] posix-timers: Rework timer removal
  2025-03-08 16:48 ` [patch V3 09/18] posix-timers: Rework timer removal Thomas Gleixner
  2025-03-09 23:17   ` Frederic Weisbecker
@ 2025-03-10  8:13   ` Thomas Gleixner
  2025-03-13 11:31     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-10  8:13 UTC (permalink / raw)
  To: LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

sys_timer_delete() and the do_exit() cleanup function itimer_delete() are
doing the same thing, but have needlessly different implementations instead
of sharing the code.

The other oddity of timer deletion is the fact that the timer is not
invalidated before the actual deletion happens, which allows concurrent
lookups to succeed.

That's wrong because a timer which is in the process of being deleted
should not be visible and any actions like signal queueing, delivery and
rearming should not happen once the task, which invoked timer_delete(), has
the timer locked.

Rework the code so that:

   1) The signal queueing and delivery code ignore timers which are marked
      invalid

   2) The deletion implementation between sys_timer_delete() and
      itimer_delete() is shared

   3) The timer is invalidated and removed from the linked lists before
      the deletion callback of the relevant clock is invoked.

      That requires to rework timer_wait_running() as it does a lookup of
      the timer when relocking it at the end. In case of deletion this
      lookup would fail due to the preceding invalidation and the wait loop
      would terminate prematurely.

      But due to the preceding invalidation the timer cannot be accessed by
      other tasks anymore, so there is no way that the timer has been freed
      after the timer lock has been dropped.

      Move the re-validation out of timer_wait_running() and handle it at
      the only other usage site, timer_settime().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
V3a: Bring back warning and fixup comment - Frederic
V2: Simplify timer_wait_running() locking - PeterZ
---
 include/linux/posix-timers.h |    7 +
 kernel/signal.c              |    2 
 kernel/time/posix-timers.c   |  194 ++++++++++++++++++-------------------------
 3 files changed, 90 insertions(+), 113 deletions(-)

--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -240,6 +240,13 @@ static inline void posixtimer_sigqueue_p
 
 	posixtimer_putref(tmr);
 }
+
+static inline bool posixtimer_valid(const struct k_itimer *timer)
+{
+	unsigned long val = (unsigned long)timer->it_signal;
+
+	return !(val & 0x1UL);
+}
 #else  /* CONFIG_POSIX_TIMERS */
 static inline void posixtimer_sigqueue_getref(struct sigqueue *q) { }
 static inline void posixtimer_sigqueue_putref(struct sigqueue *q) { }
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2092,7 +2092,7 @@ static inline void posixtimer_sig_ignore
 	 * from a non-periodic timer, then just drop the reference
 	 * count. Otherwise queue it on the ignored list.
 	 */
-	if (tmr->it_signal && tmr->it_sig_periodic)
+	if (posixtimer_valid(tmr) && tmr->it_sig_periodic)
 		hlist_add_head(&tmr->ignored_list, &tsk->signal->ignored_posix_timers);
 	else
 		posixtimer_putref(tmr);
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -279,7 +279,7 @@ static bool __posixtimer_deliver_signal(
 	 * since the signal was queued. In either case, don't rearm and
 	 * drop the signal.
 	 */
-	if (timr->it_signal_seq != timr->it_sigqueue_seq || WARN_ON_ONCE(!timr->it_signal))
+	if (timr->it_signal_seq != timr->it_sigqueue_seq || WARN_ON_ONCE(!posixtimer_valid(timr)))
 		return false;
 
 	if (!timr->it_interval || WARN_ON_ONCE(timr->it_status != POSIX_TIMER_REQUEUE_PENDING))
@@ -324,6 +324,9 @@ void posix_timer_queue_signal(struct k_i
 {
 	lockdep_assert_held(&timr->it_lock);
 
+	if (!posixtimer_valid(timr))
+		return;
+
 	timr->it_status = timr->it_interval ? POSIX_TIMER_REQUEUE_PENDING : POSIX_TIMER_DISARMED;
 	posixtimer_send_sigqueue(timr);
 }
@@ -553,11 +556,11 @@ static struct k_itimer *__lock_timer(tim
 	 * The hash lookup and the timers are RCU protected.
 	 *
 	 * Timers are added to the hash in invalid state where
-	 * timr::it_signal == NULL. timer::it_signal is only set after the
-	 * rest of the initialization succeeded.
+	 * timr::it_signal is marked invalid. timer::it_signal is only set
+	 * after the rest of the initialization succeeded.
 	 *
 	 * Timer destruction happens in steps:
-	 *  1) Set timr::it_signal to NULL with timr::it_lock held
+	 *  1) Set timr::it_signal marked invalid with timr::it_lock held
 	 *  2) Release timr::it_lock
 	 *  3) Remove from the hash under hash_lock
 	 *  4) Put the reference count.
@@ -574,8 +577,8 @@ static struct k_itimer *__lock_timer(tim
 	 *
 	 * The lookup validates locklessly that timr::it_signal ==
 	 * current::it_signal and timr::it_id == @timer_id. timr::it_id
-	 * can't change, but timr::it_signal becomes NULL during
-	 * destruction.
+	 * can't change, but timr::it_signal can become invalid during
+	 * destruction, which makes the locked check fail.
 	 */
 	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
@@ -811,22 +814,13 @@ static void common_timer_wait_running(st
  * when the task which tries to delete or disarm the timer has preempted
  * the task which runs the expiry in task work context.
  */
-static struct k_itimer *timer_wait_running(struct k_itimer *timer)
+static void timer_wait_running(struct k_itimer *timer)
 {
-	timer_t timer_id = READ_ONCE(timer->it_id);
-
-	/* Prevent kfree(timer) after dropping the lock */
-	scoped_guard (rcu) {
-		unlock_timer(timer);
-		/*
-		 * kc->timer_wait_running() might drop RCU lock. So @timer
-		 * cannot be touched anymore after the function returns!
-		 */
-		timer->kclock->timer_wait_running(timer);
-	}
-
-	/* Relock the timer. It might be not longer hashed. */
-	return lock_timer(timer_id);
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
+	timer->kclock->timer_wait_running(timer);
 }
 
 /*
@@ -885,8 +879,7 @@ static int do_timer_settime(timer_t time
 			    struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	struct k_itimer *timr;
-	int error;
+	int ret;
 
 	if (!timespec64_valid(&new_spec64->it_interval) ||
 	    !timespec64_valid(&new_spec64->it_value))
@@ -895,29 +888,36 @@ static int do_timer_settime(timer_t time
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	timr = lock_timer(timer_id);
-retry:
-	if (!timr)
-		return -EINVAL;
+	for (;;) {
+		struct k_itimer *timr = lock_timer(timer_id);
 
-	if (old_spec64)
-		old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
+		if (!timr)
+			return -EINVAL;
 
-	/* Prevent signal delivery and rearming. */
-	timr->it_signal_seq++;
+		if (old_spec64)
+			old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
 
-	error = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+		/* Prevent signal delivery and rearming. */
+		timr->it_signal_seq++;
+
+		ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+		if (ret != TIMER_RETRY) {
+			unlock_timer(timr);
+			break;
+		}
 
-	if (error == TIMER_RETRY) {
-		// We already got the old time...
+		/* Read the old time only once */
 		old_spec64 = NULL;
-		/* Unlocks and relocks the timer if it still exists */
-		timr = timer_wait_running(timr);
-		goto retry;
+		/* Protect the timer from being freed after the lock is dropped */
+		guard(rcu)();
+		unlock_timer(timr);
+		/*
+		 * timer_wait_running() might drop RCU read side protection
+		 * so the timer has to be looked up again!
+		 */
+		timer_wait_running(timr);
 	}
-	unlock_timer(timr);
-
-	return error;
+	return ret;
 }
 
 /* Set a POSIX.1b interval timer */
@@ -988,90 +988,56 @@ static inline void posix_timer_cleanup_i
 	}
 }
 
-/* Delete a POSIX.1b interval timer. */
-SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
+static void posix_timer_delete(struct k_itimer *timer)
 {
-	struct k_itimer *timer = lock_timer(timer_id);
-
-retry_delete:
-	if (!timer)
-		return -EINVAL;
-
-	/* Prevent signal delivery and rearming. */
+	/*
+	 * Invalidate the timer, remove it from the linked list and remove
+	 * it from the ignored list if pending.
+	 *
+	 * The invalidation must be written with siglock held so that the
+	 * signal code observes the invalidated timer::it_signal in
+	 * do_sigaction(), which prevents it from moving a pending signal
+	 * of a deleted timer to the ignore list.
+	 *
+	 * The invalidation also prevents signal queueing, signal delivery
+	 * and therefore rearming from the signal delivery path.
+	 *
+	 * A concurrent lookup can still find the timer in the hash, but it
+	 * will check timer::it_signal with timer::it_lock held and observe
+	 * bit 0 set, which invalidates it. That also prevents the timer ID
+	 * from being handed out before this timer is completely gone.
+	 */
 	timer->it_signal_seq++;
 
-	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
-		/* Unlocks and relocks the timer if it still exists */
-		timer = timer_wait_running(timer);
-		goto retry_delete;
-	}
-
 	scoped_guard (spinlock, &current->sighand->siglock) {
+		unsigned long sig = (unsigned long)timer->it_signal | 1UL;
+
+		WRITE_ONCE(timer->it_signal, (struct signal_struct *)sig);
 		hlist_del(&timer->list);
 		posix_timer_cleanup_ignored(timer);
-		/*
-		 * A concurrent lookup could check timer::it_signal lockless. It
-		 * will reevaluate with timer::it_lock held and observe the NULL.
-		 *
-		 * It must be written with siglock held so that the signal code
-		 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
-		 * which prevents it from moving a pending signal of a deleted
-		 * timer to the ignore list.
-		 */
-		WRITE_ONCE(timer->it_signal, NULL);
 	}
 
-	unlock_timer(timer);
-	posix_timer_unhash_and_free(timer);
-	return 0;
+	while (timer->kclock->timer_del(timer) == TIMER_RETRY) {
+		guard(rcu)();
+		spin_unlock_irq(&timer->it_lock);
+		timer_wait_running(timer);
+		spin_lock_irq(&timer->it_lock);
+	}
 }
 
-/*
- * Delete a timer if it is armed, remove it from the hash and schedule it
- * for RCU freeing.
- */
-static void itimer_delete(struct k_itimer *timer)
+/* Delete a POSIX.1b interval timer. */
+SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	spin_lock_irq(&timer->it_lock);
-
-retry_delete:
-	/*
-	 * Even if the timer is not longer accessible from other tasks
-	 * it still might be armed and queued in the underlying timer
-	 * mechanism. Worse, that timer mechanism might run the expiry
-	 * function concurrently.
-	 */
-	if (timer->kclock->timer_del(timer) == TIMER_RETRY) {
-		/*
-		 * Timer is expired concurrently, prevent livelocks
-		 * and pointless spinning on RT.
-		 *
-		 * timer_wait_running() drops timer::it_lock, which opens
-		 * the possibility for another task to delete the timer.
-		 *
-		 * That's not possible here because this is invoked from
-		 * do_exit() only for the last thread of the thread group.
-		 * So no other task can access and delete that timer.
-		 */
-		if (WARN_ON_ONCE(timer_wait_running(timer) != timer))
-			return;
-
-		goto retry_delete;
-	}
-	hlist_del(&timer->list);
-
-	posix_timer_cleanup_ignored(timer);
+	struct k_itimer *timer = lock_timer(timer_id);
 
-	/*
-	 * Setting timer::it_signal to NULL is technically not required
-	 * here as nothing can access the timer anymore legitimately via
-	 * the hash table. Set it to NULL nevertheless so that all deletion
-	 * paths are consistent.
-	 */
-	WRITE_ONCE(timer->it_signal, NULL);
+	if (!timer)
+		return -EINVAL;
 
-	spin_unlock_irq(&timer->it_lock);
+	posix_timer_delete(timer);
+	unlock_timer(timer);
+	/* Remove it from the hash, which frees up the timer ID */
 	posix_timer_unhash_and_free(timer);
+	return 0;
 }
 
 /*
@@ -1082,6 +1048,8 @@ static void itimer_delete(struct k_itime
 void exit_itimers(struct task_struct *tsk)
 {
 	struct hlist_head timers;
+	struct hlist_node *next;
+	struct k_itimer *timer;
 
 	if (hlist_empty(&tsk->signal->posix_timers))
 		return;
@@ -1091,8 +1059,10 @@ void exit_itimers(struct task_struct *ts
 		hlist_move_list(&tsk->signal->posix_timers, &timers);
 
 	/* The timers are not longer accessible via tsk::signal */
-	while (!hlist_empty(&timers)) {
-		itimer_delete(hlist_entry(timers.first, struct k_itimer, list));
+	hlist_for_each_entry_safe(timer, next, &timers, list) {
+		scoped_guard (spinlock_irq, &timer->it_lock)
+			posix_timer_delete(timer);
+		posix_timer_unhash_and_free(timer);
 		cond_resched();
 	}
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 10/18] posix-timers: Make lock_timer() use guard()
  2025-03-08 16:48 ` [patch V3 10/18] posix-timers: Make lock_timer() use guard() Thomas Gleixner
@ 2025-03-10 11:57   ` Frederic Weisbecker
  2025-03-10 17:36     ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-10 11:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:34PM +0100, Thomas Gleixner a écrit :
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -63,9 +63,18 @@ static struct k_itimer *__lock_timer(tim
>  
>  static inline void unlock_timer(struct k_itimer *timr)
>  {
> -	spin_unlock_irq(&timr->it_lock);
> +	if (likely((timr)))
> +		spin_unlock_irq(&timr->it_lock);
>  }
>  
> +#define scoped_timer_get_or_fail(_id)					\
> +	scoped_cond_guard(lock_timer, return -EINVAL, _id)

I'm not really fond of the fact this hides a return.

But anyway:

Acked-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 10/18] posix-timers: Make lock_timer() use guard()
  2025-03-10 11:57   ` Frederic Weisbecker
@ 2025-03-10 17:36     ` Thomas Gleixner
  2025-03-10 22:16       ` Frederic Weisbecker
  0 siblings, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-10 17:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

On Mon, Mar 10 2025 at 12:57, Frederic Weisbecker wrote:
> Le Sat, Mar 08, 2025 at 05:48:34PM +0100, Thomas Gleixner a écrit :
>> --- a/kernel/time/posix-timers.c
>> +++ b/kernel/time/posix-timers.c
>> @@ -63,9 +63,18 @@ static struct k_itimer *__lock_timer(tim
>>  
>>  static inline void unlock_timer(struct k_itimer *timr)
>>  {
>> -	spin_unlock_irq(&timr->it_lock);
>> +	if (likely((timr)))
>> +		spin_unlock_irq(&timr->it_lock);
>>  }
>>  
>> +#define scoped_timer_get_or_fail(_id)					\
>> +	scoped_cond_guard(lock_timer, return -EINVAL, _id)
>
> I'm not really fond of the fact this hides a return.

I could drop the macro and let the call sites all do:

	scoped_cond_guard(lock_timer, return -EINVAL, $d)

But I'm not sure it's much better :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 10/18] posix-timers: Make lock_timer() use guard()
  2025-03-10 17:36     ` Thomas Gleixner
@ 2025-03-10 22:16       ` Frederic Weisbecker
  0 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-10 22:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Mon, Mar 10, 2025 at 06:36:18PM +0100, Thomas Gleixner a écrit :
> On Mon, Mar 10 2025 at 12:57, Frederic Weisbecker wrote:
> > Le Sat, Mar 08, 2025 at 05:48:34PM +0100, Thomas Gleixner a écrit :
> >> --- a/kernel/time/posix-timers.c
> >> +++ b/kernel/time/posix-timers.c
> >> @@ -63,9 +63,18 @@ static struct k_itimer *__lock_timer(tim
> >>  
> >>  static inline void unlock_timer(struct k_itimer *timr)
> >>  {
> >> -	spin_unlock_irq(&timr->it_lock);
> >> +	if (likely((timr)))
> >> +		spin_unlock_irq(&timr->it_lock);
> >>  }
> >>  
> >> +#define scoped_timer_get_or_fail(_id)					\
> >> +	scoped_cond_guard(lock_timer, return -EINVAL, _id)
> >
> > I'm not really fond of the fact this hides a return.
> 
> I could drop the macro and let the call sites all do:
> 
> 	scoped_cond_guard(lock_timer, return -EINVAL, $d)
> 
> But I'm not sure it's much better :)

Nah let's just keep it as is, until we ever find a better idea :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
  2025-03-08 16:48 ` [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t Thomas Gleixner
@ 2025-03-10 22:57   ` Frederic Weisbecker
  2025-03-11 13:41   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
  2 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-10 22:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:36PM +0100, Thomas Gleixner a écrit :
> From: Eric Dumazet <edumazet@google.com>
> 
> The global hash_lock protecting the posix timer hash table can be heavily
> contended especially when there is an extensive linear search for a timer
> ID.
> 
> Timer IDs are handed out by monotonically increasing next_posix_timer_id
> and then validating that there is no timer with the same ID in the hash
> table. Both operations happen with the global hash lock held.
> 
> To reduce the hash lock contention the hash will be reworked to a scaled
> hash with per bucket locks, which requires to handle the ID counter
> lockless.
> 
> Prepare for this by making next_posix_timer_id an atomic_t, which can be
> used lockless with atomic_inc_return().
> 
> [ tglx: Adopted from Eric's series, massaged change log and simplified it ]
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table
  2025-03-08 16:48 ` [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table Thomas Gleixner
@ 2025-03-11 13:25   ` Frederic Weisbecker
  2025-03-11 14:16     ` Thomas Gleixner
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 13:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:14PM +0100, Thomas Gleixner a écrit :
>  kernel/time/posix-timers.c |   56 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 42 insertions(+), 14 deletions(-)
> 
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -72,13 +72,13 @@ static int hash(struct signal_struct *si
>  	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
>  }
>  
> -static struct k_itimer *__posix_timers_find(struct hlist_head *head,
> -					    struct signal_struct *sig,
> -					    timer_t id)
> +static struct k_itimer *posix_timer_by_id(timer_t id)
>  {
> +	struct signal_struct *sig = current->signal;
> +	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
>  	struct k_itimer *timer;
>  
> -	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
> +	hlist_for_each_entry_rcu(timer, head, t_hash) {
>  		/* timer->it_signal can be set concurrently */
>  		if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
>  			return timer;
> @@ -86,12 +86,26 @@ static struct k_itimer *__posix_timers_f
>  	return NULL;
>  }
>  
> -static struct k_itimer *posix_timer_by_id(timer_t id)
> +static inline struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
>  {
> -	struct signal_struct *sig = current->signal;
> -	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
> +	unsigned long val = (unsigned long)timer->it_signal;

When used from posix_timer_add() -> posix_timer_hashed(), it can race
with another do_timer_create() that clears the BIT 0. It's fine but
KCSAN is going to warn sooner or later.

It looks like a good candidate for data_race() ? Well, READ_ONCE() is
fine too.

Thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
  2025-03-08 16:48 ` [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t Thomas Gleixner
  2025-03-10 22:57   ` Frederic Weisbecker
@ 2025-03-11 13:41   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
  2 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 13:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:36PM +0100, Thomas Gleixner a écrit :
> From: Eric Dumazet <edumazet@google.com>
> 
> The global hash_lock protecting the posix timer hash table can be heavily
> contended especially when there is an extensive linear search for a timer
> ID.
> 
> Timer IDs are handed out by monotonically increasing next_posix_timer_id
> and then validating that there is no timer with the same ID in the hash
> table. Both operations happen with the global hash lock held.
> 
> To reduce the hash lock contention the hash will be reworked to a scaled
> hash with per bucket locks, which requires to handle the ID counter
> lockless.
> 
> Prepare for this by making next_posix_timer_id an atomic_t, which can be
> used lockless with atomic_inc_return().
> 
> [ tglx: Adopted from Eric's series, massaged change log and simplified it ]
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com

Acked-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 12/18] posix-timers: Improve hash table performance
  2025-03-08 16:48 ` [patch V3 12/18] posix-timers: Improve hash table performance Thomas Gleixner
@ 2025-03-11 13:44   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 13:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:38PM +0100, Thomas Gleixner a écrit :
> Eric and Ben reported a significant performance bottleneck on the global
> hash, which is used to store posix timers for lookup.
> 
> Eric tried to do a lockless validation of a new timer ID before trying to
> insert the timer, but that does not solve the problem.
> 
> For the non-contended case this is a pointless exercise and for the
> contended case this extra lookup just creates enough interleaving that all
> tasks can make progress.
> 
> There are actually two real solutions to the problem:
> 
>   1) Provide a per process (signal struct) xarray storage
> 
>   2) Implement a smarter hash like the one in the futex code
> 
> #1 works perfectly fine for most cases, but the fact that CRIU enforced a
>    linear increasing timer ID to restore timers makes this problematic.
> 
>    It's easy enough to create a sparse timer ID space, which amounts very
>    fast to a large junk of memory consumed for the xarray. 2048 timers with
>    a ID offset of 512 consume more than one megabyte of memory for the
>    xarray storage.
> 
> #2 The main advantage of the futex hash is that it uses per hash bucket
>    locks instead of a global hash lock. Aside of that it is scaled
>    according to the number of CPUs at boot time.
> 
> Experiments with artifical benchmarks have shown that a scaled hash with
> per bucket locks comes pretty close to the xarray performance and in some
> scenarios it performes better.
> 
> Test 1:
> 
>      A single process creates 20000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them:
> 
>             mainline        Eric   newhash   xarray
> create         23 ms       23 ms      9 ms     8 ms
> getoverrun     14 ms       14 ms      5 ms     4 ms
> 
> Test 2:
> 
>      A single process creates 50000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them:
> 
>             mainline        Eric   newhash   xarray
> create         98 ms      219 ms     20 ms    18 ms
> getoverrun     62 ms       62 ms     10 ms     9 ms
> 
> Test 3:
> 
>      A single process creates 100000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them:
> 
>             mainline        Eric   newhash   xarray
> create        313 ms      750 ms     48 ms    33 ms
> getoverrun    261 ms      260 ms     20 ms    14 ms
> 
> Erics changes create quite some overhead in the create() path due to the
> double list walk, as the main issue according to perf is the list walk
> itself. With 100k timers each hash bucket contains ~200 timers, which in
> the worst case need to be all inspected. The same problem applies for
> getoverrun() where the lookup has to walk through the hash buckets to find
> the timer it is looking for.
> 
> The scaled hash obviously reduces hash collisions and lock contention
> significantly. This becomes more prominent with concurrency.
> 
> Test 4:
> 
>      A process creates 63 threads and all threads wait on a barrier before
>      each instance creates 20000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them. The threads are pinned on
>      seperate CPUs to achive maximum concurrency. The numbers are the
>      average times per thread:
> 
>             mainline        Eric   newhash   xarray
> create     180239 ms    38599 ms    579 ms   813 ms
> getoverrun   2645 ms     2642 ms     32 ms     7 ms
> 
> Test 5:
> 
>      A process forks 63 times and all forks wait on a barrier before each
>      instance creates 20000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them. The processes are pinned on
>      seperate CPUs to achive maximum concurrency. The numbers are the
>      average times per process:
> 
>             mainline        eric   newhash   xarray
> create     157253 ms    40008 ms     83 ms    60 ms
> getoverrun   2611 ms     2614 ms     40 ms     4 ms
> 
> So clearly the reduction of lock contention with Eric's changes makes a
> significant difference for the create() loop, but it does not mitigate the
> problem of long list walks, which is clearly visible on the getoverrun()
> side because that is purely dominated by the lookup itself. Once the timer
> is found, the syscall just reads from the timer structure with no other
> locks or code paths involved and returns.
> 
> The reason for the difference between the thread and the fork case for the
> new hash and the xarray is that both suffer from contention on
> sighand::siglock and the xarray suffers additionally from contention on the
> xarray lock on insertion.
> 
> The only case where the reworked hash slighly outperforms the xarray is a
> tight loop which creates and deletes timers.
> 
> Test 4:
> 
>      A process creates 63 threads and all threads wait on a barrier before
>      each instance runs a loop which creates and deletes a timer 100000
>      times in a row. The threads are pinned on seperate CPUs to achive
>      maximum concurrency. The numbers are the average times per thread:
> 
>             mainline        Eric   newhash   xarray
> loop	    5917  ms	 5897 ms   5473 ms  7846 ms
> 
> Test 5:
> 
>      A process forks 63 times and all forks wait on a barrier before each
>      each instance runs a loop which creates and deletes a timer 100000
>      times in a row. The processes are pinned on seperate CPUs to achive
>      maximum concurrency. The numbers are the average times per process:
> 
>             mainline        Eric   newhash   xarray
> loop	     5137 ms	 7828 ms    891 ms   872 ms
> 
> In both test there is not much contention on the hash, but the ucount
> accounting for the signal and in the thread case the sighand::siglock
> contention (plus the xarray locking) contribute dominantly to the overhead.
> 
> As the memory consumption of the xarray in the sparse ID case is
> significant, the scaled hash with per bucket locks seems to be the better
> overall option. While the xarray has faster lookup times for a large number
> of timers, the actual syscall usage, which requires the lookup is not an
> extreme hotpath. Most applications utilize signal delivery and all syscalls
> except timer_getoverrun(2) are all but cheap.
> 
> So implement a scaled hash with per bucket locks, which offers the best
> tradeoff between performance and memory consumption.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 14/18] posix-timers: Avoid false cacheline sharing
  2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
@ 2025-03-11 13:53   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 13:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:42PM +0100, Thomas Gleixner a écrit :
> struct k_itimer has the hlist_node, which is used for lookup in the hash
> bucket, and the timer lock in the same cache line.
> 
> That's obviously bad, if one CPU fiddles with a timer and the other is
> walking the hash bucket on which that timer is queued.
> 
> Avoid this by restructuring struct k_itimer, so that the read mostly (only
> modified during setup and teardown) fields are in the first cache line and
> the lock and the rest of the fields which get written to are in cacheline
> 2-N.
> 
> Reduces cacheline contention in a test case of 64 processes creating and
> accessing 20000 timers each by almost 30% according to perf.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Impressive what a fields reshuffle and alignement can achieve!

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table
  2025-03-11 13:25   ` Frederic Weisbecker
@ 2025-03-11 14:16     ` Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-11 14:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

On Tue, Mar 11 2025 at 14:25, Frederic Weisbecker wrote:
> Le Sat, Mar 08, 2025 at 05:48:14PM +0100, Thomas Gleixner a écrit :
>>  
>> -static struct k_itimer *posix_timer_by_id(timer_t id)
>> +static inline struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
>>  {
>> -	struct signal_struct *sig = current->signal;
>> -	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
>> +	unsigned long val = (unsigned long)timer->it_signal;
>
> When used from posix_timer_add() -> posix_timer_hashed(), it can race
> with another do_timer_create() that clears the BIT 0. It's fine but
> KCSAN is going to warn sooner or later.

Indeed

> It looks like a good candidate for data_race() ? Well, READ_ONCE() is
> fine too.

READ_ONCE() is the right thing to do.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
  2025-03-08 16:48 ` [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held Thomas Gleixner
  2025-03-08 22:38   ` Cyrill Gorcunov
@ 2025-03-11 15:26   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 15:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:45PM +0100, Thomas Gleixner a écrit :
> The readout of /proc/$PID/timers holds sighand::siglock with interrupts
> disabled. That is required to protect against concurrent modifications of
> the task::signal::posix_timers list because the list is not RCU safe.
> 
> With the conversion of the timer storage to a RCU protected hlist, this is
> not longer required.
> 
> The only requirement is to protect the returned entry against a concurrent
> free, which is trivial as the timers are RCU protected.
> 
> Removing the trylock of sighand::siglock is benign because the life time of
> task_struct::signal is bound to the life time of the task_struct itself.
> 
> There are two scenarios where this matters:
> 
>   1) The process is life and not about to be checkpointed
> 
>   2) The process is stopped via ptrace for checkpointing
> 
> #1 is a racy snapshot of the armed timers and nothing can rely on it. It's
>    not more than debug information and it has been that way before because
>    sighand lock is dropped when the buffer is full and the restart of
>    the iteration might find a completely different set of timers.
> 
>    The task and therefore task::signal cannot be freed as timers_start()
>    acquired a reference count via get_pid_task().
> 
> #2 the process is stopped for checkpointing so nothing can delete or create
>    timers at this point. Neither can the process exit during the traversal.
> 
>    If CRIU fails to observe an exit in progress prior to the dissimination
>    of the timers, then there are more severe problems to solve in the CRIU
>    mechanics as they can't rely on posix timers being enabled in the first
>    place.
> 
> Therefore replace the lock acquisition with rcu_read_lock() and switch the
> timer storage traversal over to seq_hlist_*_rcu().
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 15/18] posix-timers: Make per process list RCU safe
  2025-03-08 16:48 ` [patch V3 15/18] posix-timers: Make per process list RCU safe Thomas Gleixner
@ 2025-03-11 15:29   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 15:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:43PM +0100, Thomas Gleixner a écrit :
> Preparatory change to remove the sighand locking from the /proc/$PID/timers
> iterator.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-08 16:48 ` [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID Thomas Gleixner
  2025-03-08 22:25   ` Cyrill Gorcunov
@ 2025-03-11 21:35   ` Frederic Weisbecker
  2025-03-11 22:05     ` Thomas Gleixner
  2025-03-12 12:59     ` [patch V3 17/18] " Cyrill Gorcunov
  1 sibling, 2 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 21:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Sat, Mar 08, 2025 at 05:48:47PM +0100, Thomas Gleixner a écrit :
> @@ -364,6 +389,16 @@ static enum hrtimer_restart posix_timer_
>  	return HRTIMER_NORESTART;
>  }
>  
> +long posixtimer_create_prctl(unsigned long ctrl)
> +{
> +	if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)
> +		return -EINVAL;
> +
> +	guard(spinlock_irq)(&current->sighand->siglock);
> +	current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;

Is the locking necessary here? It's not used on the read side.
It only makes sense if more flags are to be added later in struct signal and the
fields write can race.

Also do we want to carry this PR_TIMER_CREATE_RESTORE_IDS_ON accross exec? Posix
timers are removed then anyway.

Thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3a 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode
  2025-03-10  8:11   ` [patch V3a " Thomas Gleixner
@ 2025-03-11 21:44     ` Frederic Weisbecker
  2025-03-13 11:31     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 21:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Mon, Mar 10, 2025 at 09:11:42AM +0100, Thomas Gleixner a écrit :
> The exact timer ID allocation mode is used by CRIU to restore timers with a
> given ID. Add a test case for it.
> 
> It's skipped on older kernels when the prctl() fails.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-11 21:35   ` Frederic Weisbecker
@ 2025-03-11 22:05     ` Thomas Gleixner
  2025-03-11 22:07       ` [patch V3a " Thomas Gleixner
  2025-03-12 12:59     ` [patch V3 17/18] " Cyrill Gorcunov
  1 sibling, 1 reply; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-11 22:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

On Tue, Mar 11 2025 at 22:35, Frederic Weisbecker wrote:
> Le Sat, Mar 08, 2025 at 05:48:47PM +0100, Thomas Gleixner a écrit :
>> @@ -364,6 +389,16 @@ static enum hrtimer_restart posix_timer_
>>  	return HRTIMER_NORESTART;
>>  }
>>  
>> +long posixtimer_create_prctl(unsigned long ctrl)
>> +{
>> +	if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)
>> +		return -EINVAL;
>> +
>> +	guard(spinlock_irq)(&current->sighand->siglock);
>> +	current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;
>
> Is the locking necessary here? It's not used on the read side.
> It only makes sense if more flags are to be added later in struct signal and the
> fields write can race.

True.

> Also do we want to carry this PR_TIMER_CREATE_RESTORE_IDS_ON accross exec? Posix
> timers are removed then anyway.

Indeed, we should clear that.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-11 22:05     ` Thomas Gleixner
@ 2025-03-11 22:07       ` Thomas Gleixner
  2025-03-11 22:32         ` Frederic Weisbecker
  2025-03-13 11:31         ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  0 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-11 22:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
with the same timer ID on restore. It uses sys_timer_create() and relies on
the monotonic increasing timer ID provided by this syscall. It creates and
deletes timers until the desired ID is reached. This is can loop for a long
time, when the checkpointed process had a very sparse timer ID range.

It has been debated to implement a new syscall to allow the creation of
timers with a given timer ID, but that's tideous due to the 32/64bit compat
issues of sigevent_t and of dubious value.

The restore mechanism of CRIU creates the timers in a state where all
threads of the restored process are held on a barrier and cannot issue
syscalls. That means the restorer task has exclusive control.

This allows to address this issue with a prctl() so that the restorer
thread can do:

   if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
      goto linear_mode;
   create_timers_with_explicit_ids();
   prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
   
This is backwards compatible because the prctl() fails on older kernels and
CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
not know about the prctl() just work as before.

Implement the prctl() and modify timer_create() so that it copies the
requested timer ID from userspace by utilizing the existing timer_t
pointer, which is used to copy out the allocated timer ID on success.

If the prctl() is disabled, which it is by default, timer_create() works as
before and does not try to read from the userspace pointer.

There is no problem when a broken or rogue user space application enables
the prctl(). If the user space pointer does not contain a valid ID, then
timer_create() fails. If the data is not initialized, but constains a
random valid ID, timer_create() will create that random timer ID or fail if
the ID is already given out. 
 
As CRIU must use the raw syscall to avoid manipulating the internal state
of the restored process, this has no library dependencies and can be
adopted by CRIU right away.

Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
the create/delete method. With the prctl() it takes 3 microseconds.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3a: Remove the locking in the prctl() and clear restore mode on exec()
     - Frederic
V2: Move the ID counter ahead to avoid collisions after switching back to
    normal mode.
---
 include/linux/posix-timers.h |    2 
 include/linux/sched/signal.h |    1 
 include/uapi/linux/prctl.h   |   10 ++++
 kernel/sys.c                 |    5 ++
 kernel/time/posix-timers.c   |   99 +++++++++++++++++++++++++++++++------------
 5 files changed, 91 insertions(+), 26 deletions(-)

--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -114,6 +114,7 @@ bool posixtimer_init_sigqueue(struct sig
 void posixtimer_send_sigqueue(struct k_itimer *tmr);
 bool posixtimer_deliver_signal(struct kernel_siginfo *info, struct sigqueue *timer_sigq);
 void posixtimer_free_timer(struct k_itimer *timer);
+long posixtimer_create_prctl(unsigned long ctrl);
 
 /* Init task static initializer */
 #define INIT_CPU_TIMERBASE(b) {						\
@@ -140,6 +141,7 @@ static inline void posixtimer_rearm_itim
 static inline bool posixtimer_deliver_signal(struct kernel_siginfo *info,
 					     struct sigqueue *timer_sigq) { return false; }
 static inline void posixtimer_free_timer(struct k_itimer *timer) { }
+static inline long posixtimer_create_prctl(unsigned long ctrl) { return -EINVAL; }
 #endif
 
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,6 +136,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
+	unsigned int		timer_create_restore_ids:1;
 	atomic_t		next_posix_timer_id;
 	struct hlist_head	posix_timers;
 	struct hlist_head	ignored_posix_timers;
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,14 @@ struct prctl_mm_map {
  */
 #define PR_LOCK_SHADOW_STACK_STATUS      76
 
+/*
+ * Controls the mode of timer_create() for CRIU restore operations.
+ * Enabling this allows CRIU to restore timers with explicit IDs.
+ *
+ * Don't use for normal operations as the result might be undefined.
+ */
+#define PR_TIMER_CREATE_RESTORE_IDS		77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF	0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON		1
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2811,6 +2811,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			return -EINVAL;
 		error = arch_lock_shadow_stack_status(me, arg2);
 		break;
+	case PR_TIMER_CREATE_RESTORE_IDS:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = posixtimer_create_prctl(arg2);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -19,6 +19,7 @@
 #include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
+#include <linux/prctl.h>
 #include <linux/sched/task.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
@@ -57,6 +58,8 @@ static const struct k_clock * const posi
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
 static const struct k_clock clock_realtime, clock_monotonic;
 
+#define TIMER_ANY_ID		INT_MIN
+
 /* SIGEV_THREAD_ID cannot share a bit with the other SIGEV values. */
 #if SIGEV_THREAD_ID != (SIGEV_THREAD_ID & \
 			~(SIGEV_SIGNAL | SIGEV_NONE | SIGEV_THREAD))
@@ -128,38 +131,60 @@ static bool posix_timer_hashed(struct ti
 	return false;
 }
 
-static int posix_timer_add(struct k_itimer *timer)
+static bool posix_timer_add_at(struct k_itimer *timer, struct signal_struct *sig, unsigned int id)
+{
+	struct timer_hash_bucket *bucket = hash_bucket(sig, id);
+
+	scoped_guard (spinlock, &bucket->lock) {
+		/*
+		 * Validate under the lock as this could have raced against
+		 * another thread ending up with the same ID, which is
+		 * highly unlikely, but possible.
+		 */
+		if (!posix_timer_hashed(bucket, sig, id)) {
+			/*
+			 * Set the timer ID and the signal pointer to make
+			 * it identifiable in the hash table. The signal
+			 * pointer has bit 0 set to indicate that it is not
+			 * yet fully initialized. posix_timer_hashed()
+			 * masks this bit out, but the syscall lookup fails
+			 * to match due to it being set. This guarantees
+			 * that there can't be duplicate timer IDs handed
+			 * out.
+			 */
+			timer->it_id = (timer_t)id;
+			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
+			hlist_add_head_rcu(&timer->t_hash, &bucket->head);
+			return true;
+		}
+	}
+	return false;
+}
+
+static int posix_timer_add(struct k_itimer *timer, int req_id)
 {
 	struct signal_struct *sig = current->signal;
 
+	if (unlikely(req_id != TIMER_ANY_ID)) {
+		if (!posix_timer_add_at(timer, sig, req_id))
+			return -EBUSY;
+
+		/*
+		 * Move the ID counter past the requested ID, so that after
+		 * switching back to normal mode the IDs are outside of the
+		 * exact allocated region. That avoids ID collisions on the
+		 * next regular timer_create() invocations.
+		 */
+		atomic_set(&sig->next_posix_timer_id, req_id + 1);
+		return req_id;
+	}
+
 	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
 		/* Get the next timer ID and clamp it to positive space */
 		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
-		struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 
-		scoped_guard (spinlock, &bucket->lock) {
-			/*
-			 * Validate under the lock as this could have raced
-			 * against another thread ending up with the same
-			 * ID, which is highly unlikely, but possible.
-			 */
-			if (!posix_timer_hashed(bucket, sig, id)) {
-				/*
-				 * Set the timer ID and the signal pointer to make
-				 * it identifiable in the hash table. The signal
-				 * pointer has bit 0 set to indicate that it is not
-				 * yet fully initialized. posix_timer_hashed()
-				 * masks this bit out, but the syscall lookup fails
-				 * to match due to it being set. This guarantees
-				 * that there can't be duplicate timer IDs handed
-				 * out.
-				 */
-				timer->it_id = (timer_t)id;
-				timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
-				hlist_add_head_rcu(&timer->t_hash, &bucket->head);
-				return id;
-			}
-		}
+		if (posix_timer_add_at(timer, sig, id))
+			return id;
 		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
@@ -364,6 +389,15 @@ static enum hrtimer_restart posix_timer_
 	return HRTIMER_NORESTART;
 }
 
+long posixtimer_create_prctl(unsigned long ctrl)
+{
+	if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)
+		return -EINVAL;
+
+	current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;
+	return 0;
+}
+
 static struct pid *good_sigevent(sigevent_t * event)
 {
 	struct pid *pid = task_tgid(current);
@@ -435,6 +469,7 @@ static int do_timer_create(clockid_t whi
 			   timer_t __user *created_timer_id)
 {
 	const struct k_clock *kc = clockid_to_kclock(which_clock);
+	timer_t req_id = TIMER_ANY_ID;
 	struct k_itimer *new_timer;
 	int error, new_timer_id;
 
@@ -449,11 +484,20 @@ static int do_timer_create(clockid_t whi
 
 	spin_lock_init(&new_timer->it_lock);
 
+	/* Special case for CRIU to restore timers with a given timer ID. */
+	if (unlikely(current->signal->timer_create_restore_ids)) {
+		if (copy_from_user(&req_id, created_timer_id, sizeof(req_id)))
+			return -EFAULT;
+		/* Valid IDs are 0..INT_MAX */
+		if ((unsigned int)req_id > INT_MAX)
+			return -EINVAL;
+	}
+
 	/*
 	 * Add the timer to the hash table. The timer is not yet valid
 	 * after insertion, but has a unique ID allocated.
 	 */
-	new_timer_id = posix_timer_add(new_timer);
+	new_timer_id = posix_timer_add(new_timer, req_id);
 	if (new_timer_id < 0) {
 		posixtimer_free_timer(new_timer);
 		return new_timer_id;
@@ -1041,6 +1085,9 @@ void exit_itimers(struct task_struct *ts
 	struct hlist_node *next;
 	struct k_itimer *timer;
 
+	/* Clear restore mode for exec() */
+	tsk->signal->timer_create_restore_ids = 0;
+
 	if (hlist_empty(&tsk->signal->posix_timers))
 		return;
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-11 22:07       ` [patch V3a " Thomas Gleixner
@ 2025-03-11 22:32         ` Frederic Weisbecker
  2025-03-12  7:56           ` Cyrill Gorcunov
  2025-03-13 11:31         ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  1 sibling, 1 reply; 68+ messages in thread
From: Frederic Weisbecker @ 2025-03-11 22:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra, Cyrill Gorcunov

Le Tue, Mar 11, 2025 at 11:07:44PM +0100, Thomas Gleixner a écrit :
> Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
> with the same timer ID on restore. It uses sys_timer_create() and relies on
> the monotonic increasing timer ID provided by this syscall. It creates and
> deletes timers until the desired ID is reached. This is can loop for a long
> time, when the checkpointed process had a very sparse timer ID range.
> 
> It has been debated to implement a new syscall to allow the creation of
> timers with a given timer ID, but that's tideous due to the 32/64bit compat
> issues of sigevent_t and of dubious value.
> 
> The restore mechanism of CRIU creates the timers in a state where all
> threads of the restored process are held on a barrier and cannot issue
> syscalls. That means the restorer task has exclusive control.
> 
> This allows to address this issue with a prctl() so that the restorer
> thread can do:
> 
>    if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
>       goto linear_mode;
>    create_timers_with_explicit_ids();
>    prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
>    
> This is backwards compatible because the prctl() fails on older kernels and
> CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
> not know about the prctl() just work as before.
> 
> Implement the prctl() and modify timer_create() so that it copies the
> requested timer ID from userspace by utilizing the existing timer_t
> pointer, which is used to copy out the allocated timer ID on success.
> 
> If the prctl() is disabled, which it is by default, timer_create() works as
> before and does not try to read from the userspace pointer.
> 
> There is no problem when a broken or rogue user space application enables
> the prctl(). If the user space pointer does not contain a valid ID, then
> timer_create() fails. If the data is not initialized, but constains a
> random valid ID, timer_create() will create that random timer ID or fail if
> the ID is already given out. 
>  
> As CRIU must use the raw syscall to avoid manipulating the internal state
> of the restored process, this has no library dependencies and can be
> adopted by CRIU right away.
> 
> Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
> the create/delete method. With the prctl() it takes 3 microseconds.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-11 22:32         ` Frederic Weisbecker
@ 2025-03-12  7:56           ` Cyrill Gorcunov
  2025-03-12 11:24             ` Thomas Gleixner
  0 siblings, 1 reply; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-03-12  7:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Anna-Maria Behnsen, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Tue, Mar 11, 2025 at 11:32:58PM +0100, Frederic Weisbecker wrote:
...
> > 
> > Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
> > the create/delete method. With the prctl() it takes 3 microseconds.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

One thing which just popped up in my head -- this interface may be used not
only by criu but any application which wants to create timer with specified
id (hell know why, but whatever). As far as I understand we don't provide
an interface to _read_ this property, don't we? Thus criu will restore such
application which already has this bit set incorrectly.

	Cyrill

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-12  7:56           ` Cyrill Gorcunov
@ 2025-03-12 11:24             ` Thomas Gleixner
  2025-03-12 11:31               ` Thomas Gleixner
  2025-03-12 12:41               ` Cyrill Gorcunov
  0 siblings, 2 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-12 11:24 UTC (permalink / raw)
  To: Cyrill Gorcunov, Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Wed, Mar 12 2025 at 10:56, Cyrill Gorcunov wrote:
> On Tue, Mar 11, 2025 at 11:32:58PM +0100, Frederic Weisbecker wrote:
> ...
>> > 
>> > Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
>> > the create/delete method. With the prctl() it takes 3 microseconds.
>> > 
>> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> 
>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>
> One thing which just popped up in my head -- this interface may be used not
> only by criu but any application which wants to create timer with specified
> id (hell know why, but whatever). As far as I understand we don't provide

Sure. Application developers are creative :)

> an interface to _read_ this property, don't we? Thus criu will restore such
> application which already has this bit set incorrectly.

Delta patch below.

Thanks,

        tglx
---
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -362,5 +362,6 @@ struct prctl_mm_map {
 #define PR_TIMER_CREATE_RESTORE_IDS		77
 # define PR_TIMER_CREATE_RESTORE_IDS_OFF	0
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
+# define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -391,11 +391,17 @@ static enum hrtimer_restart posix_timer_
 
 long posixtimer_create_prctl(unsigned long ctrl)
 {
-	if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)
-		return -EINVAL;
-
-	current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;
-	return 0;
+	switch (ctrl) {
+	case PR_TIMER_CREATE_RESTORE_IDS_OFF:
+		current->signal->timer_create_restore_ids = 0;
+		return 0;
+	case PR_TIMER_CREATE_RESTORE_IDS_ON:
+		current->signal->timer_create_restore_ids = 0;
+		return 0;
+	case PR_TIMER_CREATE_RESTORE_IDS_GET:
+		return current->signal->timer_create_restore_ids;
+	}
+	return -EINVAL;
 }
 
 static struct pid *good_sigevent(sigevent_t * event)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-12 11:24             ` Thomas Gleixner
@ 2025-03-12 11:31               ` Thomas Gleixner
  2025-03-12 12:41               ` Cyrill Gorcunov
  1 sibling, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-12 11:31 UTC (permalink / raw)
  To: Cyrill Gorcunov, Frederic Weisbecker
  Cc: LKML, Anna-Maria Behnsen, Benjamin Segall, Eric Dumazet,
	Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Wed, Mar 12 2025 at 12:24, Thomas Gleixner wrote:
> On Wed, Mar 12 2025 at 10:56, Cyrill Gorcunov wrote:
>> an interface to _read_ this property, don't we? Thus criu will restore such
>> application which already has this bit set incorrectly.
>
> Delta patch below.

That want's a fixup for the selftest too.

---
diff --git a/tools/testing/selftests/timers/posix_timers.c b/tools/testing/selftests/timers/posix_timers.c
index 158138211f51..f0eceb0faf34 100644
--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -616,6 +616,7 @@ static int do_timer_delete(int id)
 # define PR_TIMER_CREATE_RESTORE_IDS		77
 # define PR_TIMER_CREATE_RESTORE_IDS_OFF	 0
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		 1
+# define PR_TIMER_CREATE_RESTORE_IDS_GET	 2
 #endif
 
 static void check_timer_create_exact(void)
@@ -633,6 +634,9 @@ static void check_timer_create_exact(void)
 		}
 	}
 
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_GET, 0, 0, 0) != 1)
+		fatal_error(NULL, "prctl(GET) failed\n");
+
 	id = 8;
 	if (do_timer_create(&id) < 0)
 		fatal_error(NULL, "timer_create()");
@@ -641,7 +645,10 @@ static void check_timer_create_exact(void)
 		fatal_error(NULL, "timer_delete()");
 
 	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF, 0, 0, 0))
-		fatal_error(NULL, "prctl()");
+		fatal_error(NULL, "prctl(OFF)");
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_GET, 0, 0, 0) != 0)
+		fatal_error(NULL, "prctl(GET) failed\n");
 
 	if (id != 8) {
 		ksft_test_result_fail("check timer create exact %d != 8\n", id);

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-12 11:24             ` Thomas Gleixner
  2025-03-12 11:31               ` Thomas Gleixner
@ 2025-03-12 12:41               ` Cyrill Gorcunov
  2025-03-12 17:45                 ` Thomas Gleixner
  1 sibling, 1 reply; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-03-12 12:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, LKML, Anna-Maria Behnsen, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Wed, Mar 12, 2025 at 12:24:54PM +0100, Thomas Gleixner wrote:
> +	switch (ctrl) {
> +	case PR_TIMER_CREATE_RESTORE_IDS_OFF:
> +		current->signal->timer_create_restore_ids = 0;
> +		return 0;
> +	case PR_TIMER_CREATE_RESTORE_IDS_ON:
> +		current->signal->timer_create_restore_ids = 0;

Thanks a huge, Thomas! I suspect this might be a typo, you need "= 1;" here )

	Cyrill

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-11 21:35   ` Frederic Weisbecker
  2025-03-11 22:05     ` Thomas Gleixner
@ 2025-03-12 12:59     ` Cyrill Gorcunov
  1 sibling, 0 replies; 68+ messages in thread
From: Cyrill Gorcunov @ 2025-03-12 12:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Anna-Maria Behnsen, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Tue, Mar 11, 2025 at 10:35:46PM +0100, Frederic Weisbecker wrote:
> Le Sat, Mar 08, 2025 at 05:48:47PM +0100, Thomas Gleixner a écrit :
> > @@ -364,6 +389,16 @@ static enum hrtimer_restart posix_timer_
> >  	return HRTIMER_NORESTART;
> >  }
> >  
> > +long posixtimer_create_prctl(unsigned long ctrl)
> > +{
> > +	if (ctrl > PR_TIMER_CREATE_RESTORE_IDS_ON)
> > +		return -EINVAL;
> > +
> > +	guard(spinlock_irq)(&current->sighand->siglock);
> > +	current->signal->timer_create_restore_ids = ctrl == PR_TIMER_CREATE_RESTORE_IDS_ON;
> 
> Is the locking necessary here? It's not used on the read side.
> It only makes sense if more flags are to be added later in struct signal and the
> fields write can race.

Actually this is a very subtle moment. The @timer_create_restore_ids is a bit field and
updating them without a lock already lead into hard to catch bugs in the past especially
when we have close bits members such as is_child_subreaper/has_child_subreaper near it.
I thought of fork(clone_vm) calls in multithreaded application where real_parent may
point into our task which is doing prctl but didn't find any problem so far (though
internal feeling says that this is not hot path call and better would be to keep Thomas'
original lock code :-). Anyway, seems to be safe without it.

	Cyrill

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3a 17/18] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-12 12:41               ` Cyrill Gorcunov
@ 2025-03-12 17:45                 ` Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: Thomas Gleixner @ 2025-03-12 17:45 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Frederic Weisbecker, LKML, Anna-Maria Behnsen, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra

On Wed, Mar 12 2025 at 15:41, Cyrill Gorcunov wrote:

> On Wed, Mar 12, 2025 at 12:24:54PM +0100, Thomas Gleixner wrote:
>> +	switch (ctrl) {
>> +	case PR_TIMER_CREATE_RESTORE_IDS_OFF:
>> +		current->signal->timer_create_restore_ids = 0;
>> +		return 0;
>> +	case PR_TIMER_CREATE_RESTORE_IDS_ON:
>> +		current->signal->timer_create_restore_ids = 0;
>
> Thanks a huge, Thomas! I suspect this might be a typo, you need "= 1;" here )

Ooops.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip: timers/core] selftests/timers/posix-timers: Add a test for exact allocation mode
  2025-03-10  8:11   ` [patch V3a " Thomas Gleixner
  2025-03-11 21:44     ` Frederic Weisbecker
@ 2025-03-13 11:31     ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     8e63360d869913265e5e4b623dcd23feff9fd000
Gitweb:        https://git.kernel.org/tip/8e63360d869913265e5e4b623dcd23feff9fd000
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 10 Mar 2025 09:11:42 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:18 +01:00

selftests/timers/posix-timers: Add a test for exact allocation mode

The exact timer ID allocation mode is used by CRIU to restore timers with a
given ID. Add a test case for it.

It's skipped on older kernels when the prctl() fails.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/8734fl2tkx.ffs@tglx

---
 tools/testing/selftests/timers/posix_timers.c | 73 +++++++++++++++++-
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/timers/posix_timers.c b/tools/testing/selftests/timers/posix_timers.c
index 9814b3a..f0eceb0 100644
--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -7,6 +7,7 @@
  * Kernel loop code stolen from Steven Rostedt <srostedt@redhat.com>
  */
 #define _GNU_SOURCE
+#include <sys/prctl.h>
 #include <sys/time.h>
 #include <sys/types.h>
 #include <stdio.h>
@@ -599,14 +600,84 @@ static void check_overrun(int which, const char *name)
 			 "check_overrun %s\n", name);
 }
 
+#include <sys/syscall.h>
+
+static int do_timer_create(int *id)
+{
+	return syscall(__NR_timer_create, CLOCK_MONOTONIC, NULL, id);
+}
+
+static int do_timer_delete(int id)
+{
+	return syscall(__NR_timer_delete, id);
+}
+
+#ifndef PR_TIMER_CREATE_RESTORE_IDS
+# define PR_TIMER_CREATE_RESTORE_IDS		77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF	 0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON		 1
+# define PR_TIMER_CREATE_RESTORE_IDS_GET	 2
+#endif
+
+static void check_timer_create_exact(void)
+{
+	int id;
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON, 0, 0, 0)) {
+		switch (errno) {
+		case EINVAL:
+			ksft_test_result_skip("check timer create exact, not supported\n");
+			return;
+		default:
+			ksft_test_result_skip("check timer create exact, errno = %d\n", errno);
+			return;
+		}
+	}
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_GET, 0, 0, 0) != 1)
+		fatal_error(NULL, "prctl(GET) failed\n");
+
+	id = 8;
+	if (do_timer_create(&id) < 0)
+		fatal_error(NULL, "timer_create()");
+
+	if (do_timer_delete(id))
+		fatal_error(NULL, "timer_delete()");
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF, 0, 0, 0))
+		fatal_error(NULL, "prctl(OFF)");
+
+	if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_GET, 0, 0, 0) != 0)
+		fatal_error(NULL, "prctl(GET) failed\n");
+
+	if (id != 8) {
+		ksft_test_result_fail("check timer create exact %d != 8\n", id);
+		return;
+	}
+
+	/* Validate that it went back to normal mode and allocates ID 9 */
+	if (do_timer_create(&id) < 0)
+		fatal_error(NULL, "timer_create()");
+
+	if (do_timer_delete(id))
+		fatal_error(NULL, "timer_delete()");
+
+	if (id == 9)
+		ksft_test_result_pass("check timer create exact\n");
+	else
+		ksft_test_result_fail("check timer create exact. Disabling failed.\n");
+}
+
 int main(int argc, char **argv)
 {
 	ksft_print_header();
-	ksft_set_plan(18);
+	ksft_set_plan(19);
 
 	ksft_print_msg("Testing posix timers. False negative may happen on CPU execution \n");
 	ksft_print_msg("based timers if other threads run on the CPU...\n");
 
+	check_timer_create_exact();
+
 	check_itimer(ITIMER_VIRTUAL, "ITIMER_VIRTUAL");
 	check_itimer(ITIMER_PROF, "ITIMER_PROF");
 	check_itimer(ITIMER_REAL, "ITIMER_REAL");

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Provide a mechanism to allocate a given timer ID
  2025-03-11 22:07       ` [patch V3a " Thomas Gleixner
  2025-03-11 22:32         ` Frederic Weisbecker
@ 2025-03-13 11:31         ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Frederic Weisbecker, Cyrill Gorcunov, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     ec2d0c04624b3c8a7eb1682e006717fa20cfbe24
Gitweb:        https://git.kernel.org/tip/ec2d0c04624b3c8a7eb1682e006717fa20cfbe24
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 11 Mar 2025 23:07:44 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:18 +01:00

posix-timers: Provide a mechanism to allocate a given timer ID

Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
with the same timer ID on restore. It uses sys_timer_create() and relies on
the monotonic increasing timer ID provided by this syscall. It creates and
deletes timers until the desired ID is reached. This is can loop for a long
time, when the checkpointed process had a very sparse timer ID range.

It has been debated to implement a new syscall to allow the creation of
timers with a given timer ID, but that's tideous due to the 32/64bit compat
issues of sigevent_t and of dubious value.

The restore mechanism of CRIU creates the timers in a state where all
threads of the restored process are held on a barrier and cannot issue
syscalls. That means the restorer task has exclusive control.

This allows to address this issue with a prctl() so that the restorer
thread can do:

   if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
      goto linear_mode;
   create_timers_with_explicit_ids();
   prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
   
This is backwards compatible because the prctl() fails on older kernels and
CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
not know about the prctl() just work as before.

Implement the prctl() and modify timer_create() so that it copies the
requested timer ID from userspace by utilizing the existing timer_t
pointer, which is used to copy out the allocated timer ID on success.

If the prctl() is disabled, which it is by default, timer_create() works as
before and does not try to read from the userspace pointer.

There is no problem when a broken or rogue user space application enables
the prctl(). If the user space pointer does not contain a valid ID, then
timer_create() fails. If the data is not initialized, but constains a
random valid ID, timer_create() will create that random timer ID or fail if
the ID is already given out. 
 
As CRIU must use the raw syscall to avoid manipulating the internal state
of the restored process, this has no library dependencies and can be
adopted by CRIU right away.

Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
the create/delete method. With the prctl() it takes 3 microseconds.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Tested-by: Cyrill Gorcunov <gorcunov@gmail.com>
Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx

---
 include/linux/posix-timers.h |   2 +-
 include/linux/sched/signal.h |   1 +-
 include/uapi/linux/prctl.h   |  11 ++++-
 kernel/sys.c                 |   5 ++-
 kernel/time/posix-timers.c   | 105 +++++++++++++++++++++++++---------
 5 files changed, 98 insertions(+), 26 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 094ef57..dd48c64 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -114,6 +114,7 @@ bool posixtimer_init_sigqueue(struct sigqueue *q);
 void posixtimer_send_sigqueue(struct k_itimer *tmr);
 bool posixtimer_deliver_signal(struct kernel_siginfo *info, struct sigqueue *timer_sigq);
 void posixtimer_free_timer(struct k_itimer *timer);
+long posixtimer_create_prctl(unsigned long ctrl);
 
 /* Init task static initializer */
 #define INIT_CPU_TIMERBASE(b) {						\
@@ -140,6 +141,7 @@ static inline void posixtimer_rearm_itimer(struct task_struct *p) { }
 static inline bool posixtimer_deliver_signal(struct kernel_siginfo *info,
 					     struct sigqueue *timer_sigq) { return false; }
 static inline void posixtimer_free_timer(struct k_itimer *timer) { }
+static inline long posixtimer_create_prctl(unsigned long ctrl) { return -EINVAL; }
 #endif
 
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 72649d7..1ef1edb 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,6 +136,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
+	unsigned int		timer_create_restore_ids:1;
 	atomic_t		next_posix_timer_id;
 	struct hlist_head	posix_timers;
 	struct hlist_head	ignored_posix_timers;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c60806..15c18ef 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,15 @@ struct prctl_mm_map {
  */
 #define PR_LOCK_SHADOW_STACK_STATUS      76
 
+/*
+ * Controls the mode of timer_create() for CRIU restore operations.
+ * Enabling this allows CRIU to restore timers with explicit IDs.
+ *
+ * Don't use for normal operations as the result might be undefined.
+ */
+#define PR_TIMER_CREATE_RESTORE_IDS		77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF	0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON		1
+# define PR_TIMER_CREATE_RESTORE_IDS_GET	2
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff..982e1c4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2811,6 +2811,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_lock_shadow_stack_status(me, arg2);
 		break;
+	case PR_TIMER_CREATE_RESTORE_IDS:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = posixtimer_create_prctl(arg2);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index b917a16..2ca1c55 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -19,6 +19,7 @@
 #include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
+#include <linux/prctl.h>
 #include <linux/sched/task.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
@@ -57,6 +58,8 @@ static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
 static const struct k_clock clock_realtime, clock_monotonic;
 
+#define TIMER_ANY_ID		INT_MIN
+
 /* SIGEV_THREAD_ID cannot share a bit with the other SIGEV values. */
 #if SIGEV_THREAD_ID != (SIGEV_THREAD_ID & \
 			~(SIGEV_SIGNAL | SIGEV_NONE | SIGEV_THREAD))
@@ -128,38 +131,60 @@ static bool posix_timer_hashed(struct timer_hash_bucket *bucket, struct signal_s
 	return false;
 }
 
-static int posix_timer_add(struct k_itimer *timer)
+static bool posix_timer_add_at(struct k_itimer *timer, struct signal_struct *sig, unsigned int id)
+{
+	struct timer_hash_bucket *bucket = hash_bucket(sig, id);
+
+	scoped_guard (spinlock, &bucket->lock) {
+		/*
+		 * Validate under the lock as this could have raced against
+		 * another thread ending up with the same ID, which is
+		 * highly unlikely, but possible.
+		 */
+		if (!posix_timer_hashed(bucket, sig, id)) {
+			/*
+			 * Set the timer ID and the signal pointer to make
+			 * it identifiable in the hash table. The signal
+			 * pointer has bit 0 set to indicate that it is not
+			 * yet fully initialized. posix_timer_hashed()
+			 * masks this bit out, but the syscall lookup fails
+			 * to match due to it being set. This guarantees
+			 * that there can't be duplicate timer IDs handed
+			 * out.
+			 */
+			timer->it_id = (timer_t)id;
+			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
+			hlist_add_head_rcu(&timer->t_hash, &bucket->head);
+			return true;
+		}
+	}
+	return false;
+}
+
+static int posix_timer_add(struct k_itimer *timer, int req_id)
 {
 	struct signal_struct *sig = current->signal;
 
+	if (unlikely(req_id != TIMER_ANY_ID)) {
+		if (!posix_timer_add_at(timer, sig, req_id))
+			return -EBUSY;
+
+		/*
+		 * Move the ID counter past the requested ID, so that after
+		 * switching back to normal mode the IDs are outside of the
+		 * exact allocated region. That avoids ID collisions on the
+		 * next regular timer_create() invocations.
+		 */
+		atomic_set(&sig->next_posix_timer_id, req_id + 1);
+		return req_id;
+	}
+
 	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
 		/* Get the next timer ID and clamp it to positive space */
 		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
-		struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 
-		scoped_guard (spinlock, &bucket->lock) {
-			/*
-			 * Validate under the lock as this could have raced
-			 * against another thread ending up with the same
-			 * ID, which is highly unlikely, but possible.
-			 */
-			if (!posix_timer_hashed(bucket, sig, id)) {
-				/*
-				 * Set the timer ID and the signal pointer to make
-				 * it identifiable in the hash table. The signal
-				 * pointer has bit 0 set to indicate that it is not
-				 * yet fully initialized. posix_timer_hashed()
-				 * masks this bit out, but the syscall lookup fails
-				 * to match due to it being set. This guarantees
-				 * that there can't be duplicate timer IDs handed
-				 * out.
-				 */
-				timer->it_id = (timer_t)id;
-				timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
-				hlist_add_head_rcu(&timer->t_hash, &bucket->head);
-				return id;
-			}
-		}
+		if (posix_timer_add_at(timer, sig, id))
+			return id;
 		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
@@ -364,6 +389,21 @@ static enum hrtimer_restart posix_timer_fn(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+long posixtimer_create_prctl(unsigned long ctrl)
+{
+	switch (ctrl) {
+	case PR_TIMER_CREATE_RESTORE_IDS_OFF:
+		current->signal->timer_create_restore_ids = 0;
+		return 0;
+	case PR_TIMER_CREATE_RESTORE_IDS_ON:
+		current->signal->timer_create_restore_ids = 1;
+		return 0;
+	case PR_TIMER_CREATE_RESTORE_IDS_GET:
+		return current->signal->timer_create_restore_ids;
+	}
+	return -EINVAL;
+}
+
 static struct pid *good_sigevent(sigevent_t * event)
 {
 	struct pid *pid = task_tgid(current);
@@ -435,6 +475,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 			   timer_t __user *created_timer_id)
 {
 	const struct k_clock *kc = clockid_to_kclock(which_clock);
+	timer_t req_id = TIMER_ANY_ID;
 	struct k_itimer *new_timer;
 	int error, new_timer_id;
 
@@ -449,11 +490,20 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 
 	spin_lock_init(&new_timer->it_lock);
 
+	/* Special case for CRIU to restore timers with a given timer ID. */
+	if (unlikely(current->signal->timer_create_restore_ids)) {
+		if (copy_from_user(&req_id, created_timer_id, sizeof(req_id)))
+			return -EFAULT;
+		/* Valid IDs are 0..INT_MAX */
+		if ((unsigned int)req_id > INT_MAX)
+			return -EINVAL;
+	}
+
 	/*
 	 * Add the timer to the hash table. The timer is not yet valid
 	 * after insertion, but has a unique ID allocated.
 	 */
-	new_timer_id = posix_timer_add(new_timer);
+	new_timer_id = posix_timer_add(new_timer, req_id);
 	if (new_timer_id < 0) {
 		posixtimer_free_timer(new_timer);
 		return new_timer_id;
@@ -1041,6 +1091,9 @@ void exit_itimers(struct task_struct *tsk)
 	struct hlist_node *next;
 	struct k_itimer *timer;
 
+	/* Clear restore mode for exec() */
+	tsk->signal->timer_create_restore_ids = 0;
+
 	if (hlist_empty(&tsk->signal->posix_timers))
 		return;
 

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Make per process list RCU safe
  2025-03-08 16:48 ` [patch V3 15/18] posix-timers: Make per process list RCU safe Thomas Gleixner
  2025-03-11 15:29   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     451898ea422b5861d95089d8d9c2a0ab8383775e
Gitweb:        https://git.kernel.org/tip/451898ea422b5861d95089d8d9c2a0ab8383775e
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:43 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:18 +01:00

posix-timers: Make per process list RCU safe

Preparatory change to remove the sighand locking from the /proc/$PID/timers
iterator.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.403223080@linutronix.de


---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index e4c92f4..b917a16 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -518,7 +518,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 		 * Store the unmodified signal pointer to make it valid.
 		 */
 		WRITE_ONCE(new_timer->it_signal, current->signal);
-		hlist_add_head(&new_timer->list, &current->signal->posix_timers);
+		hlist_add_head_rcu(&new_timer->list, &current->signal->posix_timers);
 	}
 	/*
 	 * After unlocking @new_timer is subject to concurrent removal and
@@ -1004,7 +1004,7 @@ static void posix_timer_delete(struct k_itimer *timer)
 		unsigned long sig = (unsigned long)timer->it_signal | 1UL;
 
 		WRITE_ONCE(timer->it_signal, (struct signal_struct *)sig);
-		hlist_del(&timer->list);
+		hlist_del_rcu(&timer->list);
 		posix_timer_cleanup_ignored(timer);
 	}
 

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
  2025-03-08 16:48 ` [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held Thomas Gleixner
  2025-03-08 22:38   ` Cyrill Gorcunov
  2025-03-11 15:26   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     2dc4dbf89cf186639c25c1b04a07c11496f060ad
Gitweb:        https://git.kernel.org/tip/2dc4dbf89cf186639c25c1b04a07c11496f060ad
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:45 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:18 +01:00

posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held

The readout of /proc/$PID/timers holds sighand::siglock with interrupts
disabled. That is required to protect against concurrent modifications of
the task::signal::posix_timers list because the list is not RCU safe.

With the conversion of the timer storage to a RCU protected hlist, this is
not longer required.

The only requirement is to protect the returned entry against a concurrent
free, which is trivial as the timers are RCU protected.

Removing the trylock of sighand::siglock is benign because the life time of
task_struct::signal is bound to the life time of the task_struct itself.

There are two scenarios where this matters:

  1) The process is life and not about to be checkpointed

  2) The process is stopped via ptrace for checkpointing

#1 is a racy snapshot of the armed timers and nothing can rely on it. It's
   not more than debug information and it has been that way before because
   sighand lock is dropped when the buffer is full and the restart of
   the iteration might find a completely different set of timers.

   The task and therefore task::signal cannot be freed as timers_start()
   acquired a reference count via get_pid_task().

#2 the process is stopped for checkpointing so nothing can delete or create
   timers at this point. Neither can the process exit during the traversal.

   If CRIU fails to observe an exit in progress prior to the dissimination
   of the timers, then there are more severe problems to solve in the CRIU
   mechanics as they can't rely on posix timers being enabled in the first
   place.

Therefore replace the lock acquisition with rcu_read_lock() and switch the
timer storage traversal over to seq_hlist_*_rcu().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.465175807@linutronix.de


---
 fs/proc/base.c | 48 ++++++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 28 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index cd89e95..5a1d682 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2497,11 +2497,9 @@ static const struct file_operations proc_map_files_operations = {
 
 #if defined(CONFIG_CHECKPOINT_RESTORE) && defined(CONFIG_POSIX_TIMERS)
 struct timers_private {
-	struct pid *pid;
-	struct task_struct *task;
-	struct sighand_struct *sighand;
-	struct pid_namespace *ns;
-	unsigned long flags;
+	struct pid		*pid;
+	struct task_struct	*task;
+	struct pid_namespace	*ns;
 };
 
 static void *timers_start(struct seq_file *m, loff_t *pos)
@@ -2512,54 +2510,48 @@ static void *timers_start(struct seq_file *m, loff_t *pos)
 	if (!tp->task)
 		return ERR_PTR(-ESRCH);
 
-	tp->sighand = lock_task_sighand(tp->task, &tp->flags);
-	if (!tp->sighand)
-		return ERR_PTR(-ESRCH);
-
-	return seq_hlist_start(&tp->task->signal->posix_timers, *pos);
+	rcu_read_lock();
+	return seq_hlist_start_rcu(&tp->task->signal->posix_timers, *pos);
 }
 
 static void *timers_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct timers_private *tp = m->private;
-	return seq_hlist_next(v, &tp->task->signal->posix_timers, pos);
+
+	return seq_hlist_next_rcu(v, &tp->task->signal->posix_timers, pos);
 }
 
 static void timers_stop(struct seq_file *m, void *v)
 {
 	struct timers_private *tp = m->private;
 
-	if (tp->sighand) {
-		unlock_task_sighand(tp->task, &tp->flags);
-		tp->sighand = NULL;
-	}
-
 	if (tp->task) {
 		put_task_struct(tp->task);
 		tp->task = NULL;
+		rcu_read_unlock();
 	}
 }
 
 static int show_timer(struct seq_file *m, void *v)
 {
-	struct k_itimer *timer;
-	struct timers_private *tp = m->private;
-	int notify;
 	static const char * const nstr[] = {
-		[SIGEV_SIGNAL] = "signal",
-		[SIGEV_NONE] = "none",
-		[SIGEV_THREAD] = "thread",
+		[SIGEV_SIGNAL]	= "signal",
+		[SIGEV_NONE]	= "none",
+		[SIGEV_THREAD]	= "thread",
 	};
 
-	timer = hlist_entry((struct hlist_node *)v, struct k_itimer, list);
-	notify = timer->it_sigev_notify;
+	struct k_itimer *timer = hlist_entry((struct hlist_node *)v, struct k_itimer, list);
+	struct timers_private *tp = m->private;
+	int notify = timer->it_sigev_notify;
+
+	guard(spinlock_irq)(&timer->it_lock);
+	if (!posixtimer_valid(timer))
+		return 0;
 
 	seq_printf(m, "ID: %d\n", timer->it_id);
-	seq_printf(m, "signal: %d/%px\n",
-		   timer->sigq.info.si_signo,
+	seq_printf(m, "signal: %d/%px\n", timer->sigq.info.si_signo,
 		   timer->sigq.info.si_value.sival_ptr);
-	seq_printf(m, "notify: %s/%s.%d\n",
-		   nstr[notify & ~SIGEV_THREAD_ID],
+	seq_printf(m, "notify: %s/%s.%d\n", nstr[notify & ~SIGEV_THREAD_ID],
 		   (notify & SIGEV_THREAD_ID) ? "tid" : "pid",
 		   pid_nr_ns(timer->it_pid, tp->ns));
 	seq_printf(m, "ClockID: %d\n", timer->it_clock);

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Avoid false cacheline sharing
  2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
  2025-03-11 13:53   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  2025-03-13 22:13   ` [patch V3 14/18] " David Laight
  2025-03-17  6:20   ` Nysal Jan K.A.
  3 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     5fa75a432f1a6b1402edd8802ecc14f8bbb90e49
Gitweb:        https://git.kernel.org/tip/5fa75a432f1a6b1402edd8802ecc14f8bbb90e49
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:42 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:18 +01:00

posix-timers: Avoid false cacheline sharing

struct k_itimer has the hlist_node, which is used for lookup in the hash
bucket, and the timer lock in the same cache line.

That's obviously bad, if one CPU fiddles with a timer and the other is
walking the hash bucket on which that timer is queued.

Avoid this by restructuring struct k_itimer, so that the read mostly (only
modified during setup and teardown) fields are in the first cache line and
the lock and the rest of the fields which get written to are in cacheline
2-N.

Reduces cacheline contention in a test case of 64 processes creating and
accessing 20000 timers each by almost 30% according to perf.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.341108067@linutronix.de


---
 include/linux/posix-timers.h | 21 ++++++++++++---------
 kernel/time/posix-timers.c   |  4 ++--
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index e714a55..094ef57 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -177,23 +177,26 @@ static inline void posix_cputimers_init_work(void) { }
  * @rcu:		RCU head for freeing the timer.
  */
 struct k_itimer {
-	struct hlist_node	list;
-	struct hlist_node	ignored_list;
+	/* 1st cacheline contains read-mostly fields */
 	struct hlist_node	t_hash;
-	spinlock_t		it_lock;
-	const struct k_clock	*kclock;
-	clockid_t		it_clock;
+	struct hlist_node	list;
 	timer_t			it_id;
+	clockid_t		it_clock;
+	int			it_sigev_notify;
+	enum pid_type		it_pid_type;
+	struct signal_struct	*it_signal;
+	const struct k_clock	*kclock;
+
+	/* 2nd cacheline and above contain fields which are modified regularly */
+	spinlock_t		it_lock;
 	int			it_status;
 	bool			it_sig_periodic;
 	s64			it_overrun;
 	s64			it_overrun_last;
 	unsigned int		it_signal_seq;
 	unsigned int		it_sigqueue_seq;
-	int			it_sigev_notify;
-	enum pid_type		it_pid_type;
 	ktime_t			it_interval;
-	struct signal_struct	*it_signal;
+	struct hlist_node	ignored_list;
 	union {
 		struct pid		*it_pid;
 		struct task_struct	*it_process;
@@ -210,7 +213,7 @@ struct k_itimer {
 		} alarm;
 	} it;
 	struct rcu_head		rcu;
-};
+} ____cacheline_aligned_in_smp;
 
 void run_posix_cpu_timers(void);
 void posix_cpu_timers_exit(struct task_struct *task);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0c4cee3..e4c92f4 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -260,8 +260,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 
 static __init int init_posix_timers(void)
 {
-	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer), 0,
-					       SLAB_ACCOUNT, NULL);
+	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer),
+					       __alignof__(struct k_itimer), SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Switch to jhash32()
  2025-03-08 16:48 ` [patch V3 13/18] posix-timers: Switch to jhash32() Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     781764e0b4394fbd8e8eb39195f8a076b60808b3
Gitweb:        https://git.kernel.org/tip/781764e0b4394fbd8e8eb39195f8a076b60808b3
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:40 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Switch to jhash32()

The hash distribution of hash_32() is suboptimal. jhash32() provides a way
better distribution, which evens out the length of the hash bucket lists,
which in turn avoids large outliers in list walk times.

Due to the sparse ID space (thanks CRIU) there is no guarantee that the
timers will be fully evenly distributed over the hash buckets, but the
behaviour is way better than with hash_32() even for randomly sparse ID
spaces.

For a pathological test case with 64 processes creating and accessing
20000 timers each, this results in a runtime reduction of ~10% and a
significantly reduced runtime variation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250308155624.279080328@linutronix.de


---
 kernel/time/posix-timers.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 23f6d8b..0c4cee3 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -11,8 +11,8 @@
  */
 #include <linux/compat.h>
 #include <linux/compiler.h>
-#include <linux/hash.h>
 #include <linux/init.h>
+#include <linux/jhash.h>
 #include <linux/interrupt.h>
 #include <linux/list.h>
 #include <linux/memblock.h>
@@ -47,11 +47,11 @@ struct timer_hash_bucket {
 
 static struct {
 	struct timer_hash_bucket	*buckets;
-	unsigned long			bits;
+	unsigned long			mask;
 } __timer_data __ro_after_init __aligned(2*sizeof(long));
 
 #define timer_buckets	(__timer_data.buckets)
-#define timer_hashbits	(__timer_data.bits)
+#define timer_hashmask	(__timer_data.mask)
 
 static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
@@ -87,7 +87,7 @@ DEFINE_CLASS_IS_COND_GUARD(lock_timer);
 
 static struct timer_hash_bucket *hash_bucket(struct signal_struct *sig, unsigned int nr)
 {
-	return &timer_buckets[hash_32(hash32_ptr(sig) ^ nr, timer_hashbits)];
+	return &timer_buckets[jhash2((u32 *)&sig, sizeof(sig) / sizeof(u32), nr) & timer_hashmask];
 }
 
 static struct k_itimer *posix_timer_by_id(timer_t id)
@@ -1513,7 +1513,7 @@ static int __init posixtimer_init(void)
 	timer_buckets = alloc_large_system_hash("posixtimers", sizeof(*timer_buckets),
 						size, 0, 0, &shift, NULL, size, size);
 	size = 1UL << shift;
-	timer_hashbits = ilog2(size);
+	timer_hashmask = size - 1;
 
 	for (i = 0; i < size; i++) {
 		spin_lock_init(&timer_buckets[i].lock);

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
  2025-03-08 16:48 ` [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t Thomas Gleixner
  2025-03-10 22:57   ` Frederic Weisbecker
  2025-03-11 13:41   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Eric Dumazet
  2 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Eric Dumazet @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric Dumazet, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     feb864ee99a2d8a22800342388401f3a3b90d42b
Gitweb:        https://git.kernel.org/tip/feb864ee99a2d8a22800342388401f3a3b90d42b
Author:        Eric Dumazet <edumazet@google.com>
AuthorDate:    Sat, 08 Mar 2025 17:48:36 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t

The global hash_lock protecting the posix timer hash table can be heavily
contended especially when there is an extensive linear search for a timer
ID.

Timer IDs are handed out by monotonically increasing next_posix_timer_id
and then validating that there is no timer with the same ID in the hash
table. Both operations happen with the global hash lock held.

To reduce the hash lock contention the hash will be reworked to a scaled
hash with per bucket locks, which requires to handle the ID counter
lockless.

Prepare for this by making next_posix_timer_id an atomic_t, which can be
used lockless with atomic_inc_return().

[ tglx: Adopted from Eric's series, massaged change log and simplified it ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de


---
 include/linux/sched/signal.h |  2 +-
 kernel/time/posix-timers.c   | 14 +++++---------
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d5d03d9..72649d7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,7 +136,7 @@ struct signal_struct {
 #ifdef CONFIG_POSIX_TIMERS
 
 	/* POSIX.1b Interval Timers */
-	unsigned int		next_posix_timer_id;
+	atomic_t		next_posix_timer_id;
 	struct hlist_head	posix_timers;
 	struct hlist_head	ignored_posix_timers;
 
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 991d12a..f9a70c1 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -119,21 +119,17 @@ static bool posix_timer_hashed(struct hlist_head *head, struct signal_struct *si
 static int posix_timer_add(struct k_itimer *timer)
 {
 	struct signal_struct *sig = current->signal;
-	struct hlist_head *head;
-	unsigned int cnt, id;
 
 	/*
 	 * FIXME: Replace this by a per signal struct xarray once there is
 	 * a plan to handle the resulting CRIU regression gracefully.
 	 */
-	for (cnt = 0; cnt <= INT_MAX; cnt++) {
-		spin_lock(&hash_lock);
-		id = sig->next_posix_timer_id;
-
-		/* Write the next ID back. Clamp it to the positive space */
-		sig->next_posix_timer_id = (id + 1) & INT_MAX;
+	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
+		/* Get the next timer ID and clamp it to positive space */
+		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
+		struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
 
-		head = &posix_timers_hashtable[hash(sig, id)];
+		spin_lock(&hash_lock);
 		if (!posix_timer_hashed(head, sig, id)) {
 			/*
 			 * Set the timer ID and the signal pointer to make

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Improve hash table performance
  2025-03-08 16:48 ` [patch V3 12/18] posix-timers: Improve hash table performance Thomas Gleixner
  2025-03-11 13:44   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric Dumazet, Benjamin Segall, Thomas Gleixner,
	Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     1535cb80286e6fbc834f075039f85274538543c7
Gitweb:        https://git.kernel.org/tip/1535cb80286e6fbc834f075039f85274538543c7
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:38 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Improve hash table performance

Eric and Ben reported a significant performance bottleneck on the global
hash, which is used to store posix timers for lookup.

Eric tried to do a lockless validation of a new timer ID before trying to
insert the timer, but that does not solve the problem.

For the non-contended case this is a pointless exercise and for the
contended case this extra lookup just creates enough interleaving that all
tasks can make progress.

There are actually two real solutions to the problem:

  1) Provide a per process (signal struct) xarray storage

  2) Implement a smarter hash like the one in the futex code

#1 works perfectly fine for most cases, but the fact that CRIU enforced a
   linear increasing timer ID to restore timers makes this problematic.

   It's easy enough to create a sparse timer ID space, which amounts very
   fast to a large junk of memory consumed for the xarray. 2048 timers with
   a ID offset of 512 consume more than one megabyte of memory for the
   xarray storage.

#2 The main advantage of the futex hash is that it uses per hash bucket
   locks instead of a global hash lock. Aside of that it is scaled
   according to the number of CPUs at boot time.

Experiments with artifical benchmarks have shown that a scaled hash with
per bucket locks comes pretty close to the xarray performance and in some
scenarios it performes better.

Test 1:

     A single process creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create         23 ms       23 ms      9 ms     8 ms
getoverrun     14 ms       14 ms      5 ms     4 ms

Test 2:

     A single process creates 50000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create         98 ms      219 ms     20 ms    18 ms
getoverrun     62 ms       62 ms     10 ms     9 ms

Test 3:

     A single process creates 100000 timers and afterwards invokes
     timer_getoverrun(2) on each of them:

            mainline        Eric   newhash   xarray
create        313 ms      750 ms     48 ms    33 ms
getoverrun    261 ms      260 ms     20 ms    14 ms

Erics changes create quite some overhead in the create() path due to the
double list walk, as the main issue according to perf is the list walk
itself. With 100k timers each hash bucket contains ~200 timers, which in
the worst case need to be all inspected. The same problem applies for
getoverrun() where the lookup has to walk through the hash buckets to find
the timer it is looking for.

The scaled hash obviously reduces hash collisions and lock contention
significantly. This becomes more prominent with concurrency.

Test 4:

     A process creates 63 threads and all threads wait on a barrier before
     each instance creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them. The threads are pinned on
     seperate CPUs to achive maximum concurrency. The numbers are the
     average times per thread:

            mainline        Eric   newhash   xarray
create     180239 ms    38599 ms    579 ms   813 ms
getoverrun   2645 ms     2642 ms     32 ms     7 ms

Test 5:

     A process forks 63 times and all forks wait on a barrier before each
     instance creates 20000 timers and afterwards invokes
     timer_getoverrun(2) on each of them. The processes are pinned on
     seperate CPUs to achive maximum concurrency. The numbers are the
     average times per process:

            mainline        eric   newhash   xarray
create     157253 ms    40008 ms     83 ms    60 ms
getoverrun   2611 ms     2614 ms     40 ms     4 ms

So clearly the reduction of lock contention with Eric's changes makes a
significant difference for the create() loop, but it does not mitigate the
problem of long list walks, which is clearly visible on the getoverrun()
side because that is purely dominated by the lookup itself. Once the timer
is found, the syscall just reads from the timer structure with no other
locks or code paths involved and returns.

The reason for the difference between the thread and the fork case for the
new hash and the xarray is that both suffer from contention on
sighand::siglock and the xarray suffers additionally from contention on the
xarray lock on insertion.

The only case where the reworked hash slighly outperforms the xarray is a
tight loop which creates and deletes timers.

Test 4:

     A process creates 63 threads and all threads wait on a barrier before
     each instance runs a loop which creates and deletes a timer 100000
     times in a row. The threads are pinned on seperate CPUs to achive
     maximum concurrency. The numbers are the average times per thread:

            mainline        Eric   newhash   xarray
loop	    5917  ms	 5897 ms   5473 ms  7846 ms

Test 5:

     A process forks 63 times and all forks wait on a barrier before each
     each instance runs a loop which creates and deletes a timer 100000
     times in a row. The processes are pinned on seperate CPUs to achive
     maximum concurrency. The numbers are the average times per process:

            mainline        Eric   newhash   xarray
loop	     5137 ms	 7828 ms    891 ms   872 ms

In both test there is not much contention on the hash, but the ucount
accounting for the signal and in the thread case the sighand::siglock
contention (plus the xarray locking) contribute dominantly to the overhead.

As the memory consumption of the xarray in the sparse ID case is
significant, the scaled hash with per bucket locks seems to be the better
overall option. While the xarray has faster lookup times for a large number
of timers, the actual syscall usage, which requires the lookup is not an
extreme hotpath. Most applications utilize signal delivery and all syscalls
except timer_getoverrun(2) are all but cheap.

So implement a scaled hash with per bucket locks, which offers the best
tradeoff between performance and memory consumption.

Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.216091571@linutronix.de


---
 kernel/time/posix-timers.c |  99 ++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 31 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index f9a70c1..23f6d8b 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -12,10 +12,10 @@
 #include <linux/compat.h>
 #include <linux/compiler.h>
 #include <linux/hash.h>
-#include <linux/hashtable.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/list.h>
+#include <linux/memblock.h>
 #include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
@@ -40,8 +40,18 @@ static struct kmem_cache *posix_timers_cache;
  * This allows checkpoint/restore to reconstruct the exact timer IDs for
  * a process.
  */
-static DEFINE_HASHTABLE(posix_timers_hashtable, 9);
-static DEFINE_SPINLOCK(hash_lock);
+struct timer_hash_bucket {
+	spinlock_t		lock;
+	struct hlist_head	head;
+};
+
+static struct {
+	struct timer_hash_bucket	*buckets;
+	unsigned long			bits;
+} __timer_data __ro_after_init __aligned(2*sizeof(long));
+
+#define timer_buckets	(__timer_data.buckets)
+#define timer_hashbits	(__timer_data.bits)
 
 static const struct k_clock * const posix_clocks[];
 static const struct k_clock *clockid_to_kclock(const clockid_t id);
@@ -75,18 +85,18 @@ static inline void unlock_timer(struct k_itimer *timr)
 DEFINE_CLASS(lock_timer, struct k_itimer *, unlock_timer(_T), __lock_timer(id), timer_t id);
 DEFINE_CLASS_IS_COND_GUARD(lock_timer);
 
-static int hash(struct signal_struct *sig, unsigned int nr)
+static struct timer_hash_bucket *hash_bucket(struct signal_struct *sig, unsigned int nr)
 {
-	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
+	return &timer_buckets[hash_32(hash32_ptr(sig) ^ nr, timer_hashbits)];
 }
 
 static struct k_itimer *posix_timer_by_id(timer_t id)
 {
 	struct signal_struct *sig = current->signal;
-	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+	struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 	struct k_itimer *timer;
 
-	hlist_for_each_entry_rcu(timer, head, t_hash) {
+	hlist_for_each_entry_rcu(timer, &bucket->head, t_hash) {
 		/* timer->it_signal can be set concurrently */
 		if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
 			return timer;
@@ -105,11 +115,13 @@ static inline struct signal_struct *posix_sig_owner(const struct k_itimer *timer
 	return (struct signal_struct *)(val & ~1UL);
 }
 
-static bool posix_timer_hashed(struct hlist_head *head, struct signal_struct *sig, timer_t id)
+static bool posix_timer_hashed(struct timer_hash_bucket *bucket, struct signal_struct *sig,
+			       timer_t id)
 {
+	struct hlist_head *head = &bucket->head;
 	struct k_itimer *timer;
 
-	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&bucket->lock)) {
 		if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
 			return true;
 	}
@@ -120,34 +132,34 @@ static int posix_timer_add(struct k_itimer *timer)
 {
 	struct signal_struct *sig = current->signal;
 
-	/*
-	 * FIXME: Replace this by a per signal struct xarray once there is
-	 * a plan to handle the resulting CRIU regression gracefully.
-	 */
 	for (unsigned int cnt = 0; cnt <= INT_MAX; cnt++) {
 		/* Get the next timer ID and clamp it to positive space */
 		unsigned int id = atomic_fetch_inc(&sig->next_posix_timer_id) & INT_MAX;
-		struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+		struct timer_hash_bucket *bucket = hash_bucket(sig, id);
 
-		spin_lock(&hash_lock);
-		if (!posix_timer_hashed(head, sig, id)) {
+		scoped_guard (spinlock, &bucket->lock) {
 			/*
-			 * Set the timer ID and the signal pointer to make
-			 * it identifiable in the hash table. The signal
-			 * pointer has bit 0 set to indicate that it is not
-			 * yet fully initialized. posix_timer_hashed()
-			 * masks this bit out, but the syscall lookup fails
-			 * to match due to it being set. This guarantees
-			 * that there can't be duplicate timer IDs handed
-			 * out.
+			 * Validate under the lock as this could have raced
+			 * against another thread ending up with the same
+			 * ID, which is highly unlikely, but possible.
 			 */
-			timer->it_id = (timer_t)id;
-			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
-			hlist_add_head_rcu(&timer->t_hash, head);
-			spin_unlock(&hash_lock);
-			return id;
+			if (!posix_timer_hashed(bucket, sig, id)) {
+				/*
+				 * Set the timer ID and the signal pointer to make
+				 * it identifiable in the hash table. The signal
+				 * pointer has bit 0 set to indicate that it is not
+				 * yet fully initialized. posix_timer_hashed()
+				 * masks this bit out, but the syscall lookup fails
+				 * to match due to it being set. This guarantees
+				 * that there can't be duplicate timer IDs handed
+				 * out.
+				 */
+				timer->it_id = (timer_t)id;
+				timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
+				hlist_add_head_rcu(&timer->t_hash, &bucket->head);
+				return id;
+			}
 		}
-		spin_unlock(&hash_lock);
 		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
@@ -405,7 +417,9 @@ void posixtimer_free_timer(struct k_itimer *tmr)
 
 static void posix_timer_unhash_and_free(struct k_itimer *tmr)
 {
-	scoped_guard (spinlock, &hash_lock)
+	struct timer_hash_bucket *bucket = hash_bucket(posix_sig_owner(tmr), tmr->it_id);
+
+	scoped_guard (spinlock, &bucket->lock)
 		hlist_del_rcu(&tmr->t_hash);
 	posixtimer_putref(tmr);
 }
@@ -1485,3 +1499,26 @@ static const struct k_clock *clockid_to_kclock(const clockid_t id)
 
 	return posix_clocks[array_index_nospec(idx, ARRAY_SIZE(posix_clocks))];
 }
+
+static int __init posixtimer_init(void)
+{
+	unsigned long i, size;
+	unsigned int shift;
+
+	if (IS_ENABLED(CONFIG_BASE_SMALL))
+		size = 512;
+	else
+		size = roundup_pow_of_two(512 * num_possible_cpus());
+
+	timer_buckets = alloc_large_system_hash("posixtimers", sizeof(*timer_buckets),
+						size, 0, 0, &shift, NULL, size, size);
+	size = 1UL << shift;
+	timer_hashbits = ilog2(size);
+
+	for (i = 0; i < size; i++) {
+		spin_lock_init(&timer_buckets[i].lock);
+		INIT_HLIST_HEAD(&timer_buckets[i].head);
+	}
+	return 0;
+}
+core_initcall(posixtimer_init);

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Make lock_timer() use guard()
  2025-03-08 16:48 ` [patch V3 10/18] posix-timers: Make lock_timer() use guard() Thomas Gleixner
  2025-03-10 11:57   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     538d710ec74233f99dc0fd604d45a2b6143c8e2c
Gitweb:        https://git.kernel.org/tip/538d710ec74233f99dc0fd604d45a2b6143c8e2c
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 08 Mar 2025 17:48:34 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Make lock_timer() use guard()

The lookup and locking of posix timers requires the same repeating pattern
at all usage sites:

   tmr = lock_timer(tiner_id);
   if (!tmr)
   	return -EINVAL;
   ....
   unlock_timer(tmr);

Solve this with a guard implementation, which works in most places out of
the box except for those, which need to unlock the timer inside the guard
scope.

Though the only places where this matters are timer_delete() and
timer_settime(). In both cases the timer pointer needs to be preserved
across the end of the scope, which is solved by storing the pointer in a
variable outside of the scope.

timer_settime() also has to protect the timer with RCU before unlocking,
which obviously can't use guard(rcu) before leaving the guard scope as that
guard is cleaned up before the unlock. Solve this by providing the RCU
protection open coded.

[ tglx: Made it work and added change log ]

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net
Link: https://lore.kernel.org/all/20250308155624.087465658@linutronix.de

---
 include/linux/cleanup.h    | 22 +++++----
 kernel/time/posix-timers.c | 92 ++++++++++++++-----------------------
 2 files changed, 50 insertions(+), 64 deletions(-)

diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h
index ec00e3f..a176abf 100644
--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@ -291,11 +291,21 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \
 #define __DEFINE_CLASS_IS_CONDITIONAL(_name, _is_cond)	\
 static __maybe_unused const bool class_##_name##_is_conditional = _is_cond
 
-#define DEFINE_GUARD(_name, _type, _lock, _unlock) \
+#define __DEFINE_GUARD_LOCK_PTR(_name, _exp) \
+	static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
+	{ return (void *)(__force unsigned long)*(_exp); }
+
+#define DEFINE_CLASS_IS_GUARD(_name) \
 	__DEFINE_CLASS_IS_CONDITIONAL(_name, false); \
+	__DEFINE_GUARD_LOCK_PTR(_name, _T)
+
+#define DEFINE_CLASS_IS_COND_GUARD(_name) \
+	__DEFINE_CLASS_IS_CONDITIONAL(_name, true); \
+	__DEFINE_GUARD_LOCK_PTR(_name, _T)
+
+#define DEFINE_GUARD(_name, _type, _lock, _unlock) \
 	DEFINE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), _type _T); \
-	static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
-	{ return (void *)(__force unsigned long)*_T; }
+	DEFINE_CLASS_IS_GUARD(_name)
 
 #define DEFINE_GUARD_COND(_name, _ext, _condlock) \
 	__DEFINE_CLASS_IS_CONDITIONAL(_name##_ext, true); \
@@ -375,11 +385,7 @@ static inline void class_##_name##_destructor(class_##_name##_t *_T)	\
 	if (_T->lock) { _unlock; }					\
 }									\
 									\
-static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)	\
-{									\
-	return (void *)(__force unsigned long)_T->lock;			\
-}
-
+__DEFINE_GUARD_LOCK_PTR(_name, &_T->lock)
 
 #define __DEFINE_LOCK_GUARD_1(_name, _type, _lock)			\
 static inline class_##_name##_t class_##_name##_constructor(_type *l)	\
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dff414b..991d12a 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -63,9 +63,18 @@ static struct k_itimer *__lock_timer(timer_t timer_id);
 
 static inline void unlock_timer(struct k_itimer *timr)
 {
-	spin_unlock_irq(&timr->it_lock);
+	if (likely((timr)))
+		spin_unlock_irq(&timr->it_lock);
 }
 
+#define scoped_timer_get_or_fail(_id)					\
+	scoped_cond_guard(lock_timer, return -EINVAL, _id)
+
+#define scoped_timer				(scope)
+
+DEFINE_CLASS(lock_timer, struct k_itimer *, unlock_timer(_T), __lock_timer(id), timer_t id);
+DEFINE_CLASS_IS_COND_GUARD(lock_timer);
+
 static int hash(struct signal_struct *sig, unsigned int nr)
 {
 	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
@@ -682,18 +691,10 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
 
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
-	struct k_itimer *timr;
-	int ret = 0;
-
-	timr = lock_timer(timer_id);
-	if (!timr)
-		return -EINVAL;
-
 	memset(setting, 0, sizeof(*setting));
-	timr->kclock->timer_get(timr, setting);
-
-	unlock_timer(timr);
-	return ret;
+	scoped_timer_get_or_fail(timer_id)
+		scoped_timer->kclock->timer_get(scoped_timer, setting);
+	return 0;
 }
 
 /* Get the time remaining on a POSIX.1b interval timer. */
@@ -747,17 +748,8 @@ SYSCALL_DEFINE2(timer_gettime32, timer_t, timer_id,
  */
 SYSCALL_DEFINE1(timer_getoverrun, timer_t, timer_id)
 {
-	struct k_itimer *timr;
-	int overrun;
-
-	timr = lock_timer(timer_id);
-	if (!timr)
-		return -EINVAL;
-
-	overrun = timer_overrun_to_int(timr);
-	unlock_timer(timr);
-
-	return overrun;
+	scoped_timer_get_or_fail(timer_id)
+		return timer_overrun_to_int(scoped_timer);
 }
 
 static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
@@ -875,12 +867,9 @@ int common_timer_set(struct k_itimer *timr, int flags,
 	return 0;
 }
 
-static int do_timer_settime(timer_t timer_id, int tmr_flags,
-			    struct itimerspec64 *new_spec64,
+static int do_timer_settime(timer_t timer_id, int tmr_flags, struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	int ret;
-
 	if (!timespec64_valid(&new_spec64->it_interval) ||
 	    !timespec64_valid(&new_spec64->it_value))
 		return -EINVAL;
@@ -888,36 +877,28 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	for (;;) {
-		struct k_itimer *timr = lock_timer(timer_id);
+	for (; ; old_spec64 = NULL) {
+		struct k_itimer *timr;
 
-		if (!timr)
-			return -EINVAL;
+		scoped_timer_get_or_fail(timer_id) {
+			timr = scoped_timer;
 
-		if (old_spec64)
-			old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
+			if (old_spec64)
+				old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
 
-		/* Prevent signal delivery and rearming. */
-		timr->it_signal_seq++;
+			/* Prevent signal delivery and rearming. */
+			timr->it_signal_seq++;
 
-		ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
-		if (ret != TIMER_RETRY) {
-			unlock_timer(timr);
-			break;
-		}
+			int ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+			if (ret != TIMER_RETRY)
+				return ret;
 
-		/* Read the old time only once */
-		old_spec64 = NULL;
-		/* Protect the timer from being freed after the lock is dropped */
-		guard(rcu)();
-		unlock_timer(timr);
-		/*
-		 * timer_wait_running() might drop RCU read side protection
-		 * so the timer has to be looked up again!
-		 */
+			/* Protect the timer from being freed when leaving the lock scope */
+			rcu_read_lock();
+		}
 		timer_wait_running(timr);
+		rcu_read_unlock();
 	}
-	return ret;
 }
 
 /* Set a POSIX.1b interval timer */
@@ -1028,13 +1009,12 @@ static void posix_timer_delete(struct k_itimer *timer)
 /* Delete a POSIX.1b interval timer. */
 SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	struct k_itimer *timer = lock_timer(timer_id);
-
-	if (!timer)
-		return -EINVAL;
+	struct k_itimer *timer;
 
-	posix_timer_delete(timer);
-	unlock_timer(timer);
+	scoped_timer_get_or_fail(timer_id) {
+		timer = scoped_timer;
+		posix_timer_delete(timer);
+	}
 	/* Remove it from the hash, which frees up the timer ID */
 	posix_timer_unhash_and_free(timer);
 	return 0;

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Rework timer removal
  2025-03-10  8:13   ` [patch V3a " Thomas Gleixner
@ 2025-03-13 11:31     ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     1d25bdd3f3831bb1b9512d4b5afcd2dea8a0c515
Gitweb:        https://git.kernel.org/tip/1d25bdd3f3831bb1b9512d4b5afcd2dea8a0c515
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 10 Mar 2025 09:13:32 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Rework timer removal

sys_timer_delete() and the do_exit() cleanup function itimer_delete() are
doing the same thing, but have needlessly different implementations instead
of sharing the code.

The other oddity of timer deletion is the fact that the timer is not
invalidated before the actual deletion happens, which allows concurrent
lookups to succeed.

That's wrong because a timer which is in the process of being deleted
should not be visible and any actions like signal queueing, delivery and
rearming should not happen once the task, which invoked timer_delete(), has
the timer locked.

Rework the code so that:

   1) The signal queueing and delivery code ignore timers which are marked
      invalid

   2) The deletion implementation between sys_timer_delete() and
      itimer_delete() is shared

   3) The timer is invalidated and removed from the linked lists before
      the deletion callback of the relevant clock is invoked.

      That requires to rework timer_wait_running() as it does a lookup of
      the timer when relocking it at the end. In case of deletion this
      lookup would fail due to the preceding invalidation and the wait loop
      would terminate prematurely.

      But due to the preceding invalidation the timer cannot be accessed by
      other tasks anymore, so there is no way that the timer has been freed
      after the timer lock has been dropped.

      Move the re-validation out of timer_wait_running() and handle it at
      the only other usage site, timer_settime().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/87zfht1exf.ffs@tglx

---
 include/linux/posix-timers.h |   7 +-
 kernel/signal.c              |   2 +-
 kernel/time/posix-timers.c   | 194 ++++++++++++++--------------------
 3 files changed, 90 insertions(+), 113 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index f11f10c..e714a55 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -240,6 +240,13 @@ static inline void posixtimer_sigqueue_putref(struct sigqueue *q)
 
 	posixtimer_putref(tmr);
 }
+
+static inline bool posixtimer_valid(const struct k_itimer *timer)
+{
+	unsigned long val = (unsigned long)timer->it_signal;
+
+	return !(val & 0x1UL);
+}
 #else  /* CONFIG_POSIX_TIMERS */
 static inline void posixtimer_sigqueue_getref(struct sigqueue *q) { }
 static inline void posixtimer_sigqueue_putref(struct sigqueue *q) { }
diff --git a/kernel/signal.c b/kernel/signal.c
index 875e97f..bb62104 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2092,7 +2092,7 @@ static inline void posixtimer_sig_ignore(struct task_struct *tsk, struct sigqueu
 	 * from a non-periodic timer, then just drop the reference
 	 * count. Otherwise queue it on the ignored list.
 	 */
-	if (tmr->it_signal && tmr->it_sig_periodic)
+	if (posixtimer_valid(tmr) && tmr->it_sig_periodic)
 		hlist_add_head(&tmr->ignored_list, &tsk->signal->ignored_posix_timers);
 	else
 		posixtimer_putref(tmr);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 4d25bea..dff414b 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -279,7 +279,7 @@ static bool __posixtimer_deliver_signal(struct kernel_siginfo *info, struct k_it
 	 * since the signal was queued. In either case, don't rearm and
 	 * drop the signal.
 	 */
-	if (timr->it_signal_seq != timr->it_sigqueue_seq || WARN_ON_ONCE(!timr->it_signal))
+	if (timr->it_signal_seq != timr->it_sigqueue_seq || WARN_ON_ONCE(!posixtimer_valid(timr)))
 		return false;
 
 	if (!timr->it_interval || WARN_ON_ONCE(timr->it_status != POSIX_TIMER_REQUEUE_PENDING))
@@ -324,6 +324,9 @@ void posix_timer_queue_signal(struct k_itimer *timr)
 {
 	lockdep_assert_held(&timr->it_lock);
 
+	if (!posixtimer_valid(timr))
+		return;
+
 	timr->it_status = timr->it_interval ? POSIX_TIMER_REQUEUE_PENDING : POSIX_TIMER_DISARMED;
 	posixtimer_send_sigqueue(timr);
 }
@@ -553,11 +556,11 @@ static struct k_itimer *__lock_timer(timer_t timer_id)
 	 * The hash lookup and the timers are RCU protected.
 	 *
 	 * Timers are added to the hash in invalid state where
-	 * timr::it_signal == NULL. timer::it_signal is only set after the
-	 * rest of the initialization succeeded.
+	 * timr::it_signal is marked invalid. timer::it_signal is only set
+	 * after the rest of the initialization succeeded.
 	 *
 	 * Timer destruction happens in steps:
-	 *  1) Set timr::it_signal to NULL with timr::it_lock held
+	 *  1) Set timr::it_signal marked invalid with timr::it_lock held
 	 *  2) Release timr::it_lock
 	 *  3) Remove from the hash under hash_lock
 	 *  4) Put the reference count.
@@ -574,8 +577,8 @@ static struct k_itimer *__lock_timer(timer_t timer_id)
 	 *
 	 * The lookup validates locklessly that timr::it_signal ==
 	 * current::it_signal and timr::it_id == @timer_id. timr::it_id
-	 * can't change, but timr::it_signal becomes NULL during
-	 * destruction.
+	 * can't change, but timr::it_signal can become invalid during
+	 * destruction, which makes the locked check fail.
 	 */
 	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
@@ -811,22 +814,13 @@ static void common_timer_wait_running(struct k_itimer *timer)
  * when the task which tries to delete or disarm the timer has preempted
  * the task which runs the expiry in task work context.
  */
-static struct k_itimer *timer_wait_running(struct k_itimer *timer)
+static void timer_wait_running(struct k_itimer *timer)
 {
-	timer_t timer_id = READ_ONCE(timer->it_id);
-
-	/* Prevent kfree(timer) after dropping the lock */
-	scoped_guard (rcu) {
-		unlock_timer(timer);
-		/*
-		 * kc->timer_wait_running() might drop RCU lock. So @timer
-		 * cannot be touched anymore after the function returns!
-		 */
-		timer->kclock->timer_wait_running(timer);
-	}
-
-	/* Relock the timer. It might be not longer hashed. */
-	return lock_timer(timer_id);
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
+	timer->kclock->timer_wait_running(timer);
 }
 
 /*
@@ -885,8 +879,7 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
 			    struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	struct k_itimer *timr;
-	int error;
+	int ret;
 
 	if (!timespec64_valid(&new_spec64->it_interval) ||
 	    !timespec64_valid(&new_spec64->it_value))
@@ -895,29 +888,36 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	timr = lock_timer(timer_id);
-retry:
-	if (!timr)
-		return -EINVAL;
+	for (;;) {
+		struct k_itimer *timr = lock_timer(timer_id);
 
-	if (old_spec64)
-		old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
+		if (!timr)
+			return -EINVAL;
+
+		if (old_spec64)
+			old_spec64->it_interval = ktime_to_timespec64(timr->it_interval);
 
-	/* Prevent signal delivery and rearming. */
-	timr->it_signal_seq++;
+		/* Prevent signal delivery and rearming. */
+		timr->it_signal_seq++;
 
-	error = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+		ret = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+		if (ret != TIMER_RETRY) {
+			unlock_timer(timr);
+			break;
+		}
 
-	if (error == TIMER_RETRY) {
-		// We already got the old time...
+		/* Read the old time only once */
 		old_spec64 = NULL;
-		/* Unlocks and relocks the timer if it still exists */
-		timr = timer_wait_running(timr);
-		goto retry;
+		/* Protect the timer from being freed after the lock is dropped */
+		guard(rcu)();
+		unlock_timer(timr);
+		/*
+		 * timer_wait_running() might drop RCU read side protection
+		 * so the timer has to be looked up again!
+		 */
+		timer_wait_running(timr);
 	}
-	unlock_timer(timr);
-
-	return error;
+	return ret;
 }
 
 /* Set a POSIX.1b interval timer */
@@ -988,90 +988,56 @@ static inline void posix_timer_cleanup_ignored(struct k_itimer *tmr)
 	}
 }
 
-/* Delete a POSIX.1b interval timer. */
-SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
+static void posix_timer_delete(struct k_itimer *timer)
 {
-	struct k_itimer *timer = lock_timer(timer_id);
-
-retry_delete:
-	if (!timer)
-		return -EINVAL;
-
-	/* Prevent signal delivery and rearming. */
+	/*
+	 * Invalidate the timer, remove it from the linked list and remove
+	 * it from the ignored list if pending.
+	 *
+	 * The invalidation must be written with siglock held so that the
+	 * signal code observes the invalidated timer::it_signal in
+	 * do_sigaction(), which prevents it from moving a pending signal
+	 * of a deleted timer to the ignore list.
+	 *
+	 * The invalidation also prevents signal queueing, signal delivery
+	 * and therefore rearming from the signal delivery path.
+	 *
+	 * A concurrent lookup can still find the timer in the hash, but it
+	 * will check timer::it_signal with timer::it_lock held and observe
+	 * bit 0 set, which invalidates it. That also prevents the timer ID
+	 * from being handed out before this timer is completely gone.
+	 */
 	timer->it_signal_seq++;
 
-	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
-		/* Unlocks and relocks the timer if it still exists */
-		timer = timer_wait_running(timer);
-		goto retry_delete;
-	}
-
 	scoped_guard (spinlock, &current->sighand->siglock) {
+		unsigned long sig = (unsigned long)timer->it_signal | 1UL;
+
+		WRITE_ONCE(timer->it_signal, (struct signal_struct *)sig);
 		hlist_del(&timer->list);
 		posix_timer_cleanup_ignored(timer);
-		/*
-		 * A concurrent lookup could check timer::it_signal lockless. It
-		 * will reevaluate with timer::it_lock held and observe the NULL.
-		 *
-		 * It must be written with siglock held so that the signal code
-		 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
-		 * which prevents it from moving a pending signal of a deleted
-		 * timer to the ignore list.
-		 */
-		WRITE_ONCE(timer->it_signal, NULL);
 	}
 
-	unlock_timer(timer);
-	posix_timer_unhash_and_free(timer);
-	return 0;
+	while (timer->kclock->timer_del(timer) == TIMER_RETRY) {
+		guard(rcu)();
+		spin_unlock_irq(&timer->it_lock);
+		timer_wait_running(timer);
+		spin_lock_irq(&timer->it_lock);
+	}
 }
 
-/*
- * Delete a timer if it is armed, remove it from the hash and schedule it
- * for RCU freeing.
- */
-static void itimer_delete(struct k_itimer *timer)
+/* Delete a POSIX.1b interval timer. */
+SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	spin_lock_irq(&timer->it_lock);
-
-retry_delete:
-	/*
-	 * Even if the timer is not longer accessible from other tasks
-	 * it still might be armed and queued in the underlying timer
-	 * mechanism. Worse, that timer mechanism might run the expiry
-	 * function concurrently.
-	 */
-	if (timer->kclock->timer_del(timer) == TIMER_RETRY) {
-		/*
-		 * Timer is expired concurrently, prevent livelocks
-		 * and pointless spinning on RT.
-		 *
-		 * timer_wait_running() drops timer::it_lock, which opens
-		 * the possibility for another task to delete the timer.
-		 *
-		 * That's not possible here because this is invoked from
-		 * do_exit() only for the last thread of the thread group.
-		 * So no other task can access and delete that timer.
-		 */
-		if (WARN_ON_ONCE(timer_wait_running(timer) != timer))
-			return;
-
-		goto retry_delete;
-	}
-	hlist_del(&timer->list);
-
-	posix_timer_cleanup_ignored(timer);
+	struct k_itimer *timer = lock_timer(timer_id);
 
-	/*
-	 * Setting timer::it_signal to NULL is technically not required
-	 * here as nothing can access the timer anymore legitimately via
-	 * the hash table. Set it to NULL nevertheless so that all deletion
-	 * paths are consistent.
-	 */
-	WRITE_ONCE(timer->it_signal, NULL);
+	if (!timer)
+		return -EINVAL;
 
-	spin_unlock_irq(&timer->it_lock);
+	posix_timer_delete(timer);
+	unlock_timer(timer);
+	/* Remove it from the hash, which frees up the timer ID */
 	posix_timer_unhash_and_free(timer);
+	return 0;
 }
 
 /*
@@ -1082,6 +1048,8 @@ retry_delete:
 void exit_itimers(struct task_struct *tsk)
 {
 	struct hlist_head timers;
+	struct hlist_node *next;
+	struct k_itimer *timer;
 
 	if (hlist_empty(&tsk->signal->posix_timers))
 		return;
@@ -1091,8 +1059,10 @@ void exit_itimers(struct task_struct *tsk)
 		hlist_move_list(&tsk->signal->posix_timers, &timers);
 
 	/* The timers are not longer accessible via tsk::signal */
-	while (!hlist_empty(&timers)) {
-		itimer_delete(hlist_entry(timers.first, struct k_itimer, list));
+	hlist_for_each_entry_safe(timer, next, &timers, list) {
+		scoped_guard (spinlock_irq, &timer->it_lock)
+			posix_timer_delete(timer);
+		posix_timer_unhash_and_free(timer);
 		cond_resched();
 	}
 

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Simplify lock/unlock_timer()
  2025-03-08 16:48 ` [patch V3 08/18] posix-timers: Simplify lock/unlock_timer() Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     50f53b23f1e3fae071381af9a15ac1028c4efc42
Gitweb:        https://git.kernel.org/tip/50f53b23f1e3fae071381af9a15ac1028c4efc42
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:30 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Simplify lock/unlock_timer()

Since the integration of sigqueue into the timer struct, lock_timer() is
only used in task context. So taking the lock with irqsave() is not longer
required.

Convert it to use spin_[un]lock_irq().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.959825668@linutronix.de

---
 kernel/time/posix-timers.c | 70 +++++++++++++++----------------------
 1 file changed, 29 insertions(+), 41 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 988cbfb..4d25bea 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -53,14 +53,19 @@ static const struct k_clock clock_realtime, clock_monotonic;
 #error "SIGEV_THREAD_ID must not share bit with other SIGEV values!"
 #endif
 
-static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags);
+static struct k_itimer *__lock_timer(timer_t timer_id);
 
-#define lock_timer(tid, flags)						   \
-({	struct k_itimer *__timr;					   \
-	__cond_lock(&__timr->it_lock, __timr = __lock_timer(tid, flags));  \
-	__timr;								   \
+#define lock_timer(tid)							\
+({	struct k_itimer *__timr;					\
+	__cond_lock(&__timr->it_lock, __timr = __lock_timer(tid));	\
+	__timr;								\
 })
 
+static inline void unlock_timer(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timr->it_lock);
+}
+
 static int hash(struct signal_struct *sig, unsigned int nr)
 {
 	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
@@ -144,11 +149,6 @@ static int posix_timer_add(struct k_itimer *timer)
 	return -EAGAIN;
 }
 
-static inline void unlock_timer(struct k_itimer *timr, unsigned long flags)
-{
-	spin_unlock_irqrestore(&timr->it_lock, flags);
-}
-
 static int posix_get_realtime_timespec(clockid_t which_clock, struct timespec64 *tp)
 {
 	ktime_get_real_ts64(tp);
@@ -538,7 +538,7 @@ COMPAT_SYSCALL_DEFINE3(timer_create, clockid_t, which_clock,
 }
 #endif
 
-static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
+static struct k_itimer *__lock_timer(timer_t timer_id)
 {
 	struct k_itimer *timr;
 
@@ -580,14 +580,14 @@ static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
 	if (timr) {
-		spin_lock_irqsave(&timr->it_lock, *flags);
+		spin_lock_irq(&timr->it_lock);
 		/*
 		 * Validate under timr::it_lock that timr::it_signal is
 		 * still valid. Pairs with #1 above.
 		 */
 		if (timr->it_signal == current->signal)
 			return timr;
-		spin_unlock_irqrestore(&timr->it_lock, *flags);
+		spin_unlock_irq(&timr->it_lock);
 	}
 	return NULL;
 }
@@ -680,17 +680,16 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
 	struct k_itimer *timr;
-	unsigned long flags;
 	int ret = 0;
 
-	timr = lock_timer(timer_id, &flags);
+	timr = lock_timer(timer_id);
 	if (!timr)
 		return -EINVAL;
 
 	memset(setting, 0, sizeof(*setting));
 	timr->kclock->timer_get(timr, setting);
 
-	unlock_timer(timr, flags);
+	unlock_timer(timr);
 	return ret;
 }
 
@@ -746,15 +745,14 @@ SYSCALL_DEFINE2(timer_gettime32, timer_t, timer_id,
 SYSCALL_DEFINE1(timer_getoverrun, timer_t, timer_id)
 {
 	struct k_itimer *timr;
-	unsigned long flags;
 	int overrun;
 
-	timr = lock_timer(timer_id, &flags);
+	timr = lock_timer(timer_id);
 	if (!timr)
 		return -EINVAL;
 
 	overrun = timer_overrun_to_int(timr);
-	unlock_timer(timr, flags);
+	unlock_timer(timr);
 
 	return overrun;
 }
@@ -813,14 +811,13 @@ static void common_timer_wait_running(struct k_itimer *timer)
  * when the task which tries to delete or disarm the timer has preempted
  * the task which runs the expiry in task work context.
  */
-static struct k_itimer *timer_wait_running(struct k_itimer *timer,
-					   unsigned long *flags)
+static struct k_itimer *timer_wait_running(struct k_itimer *timer)
 {
 	timer_t timer_id = READ_ONCE(timer->it_id);
 
 	/* Prevent kfree(timer) after dropping the lock */
 	scoped_guard (rcu) {
-		unlock_timer(timer, *flags);
+		unlock_timer(timer);
 		/*
 		 * kc->timer_wait_running() might drop RCU lock. So @timer
 		 * cannot be touched anymore after the function returns!
@@ -829,7 +826,7 @@ static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 	}
 
 	/* Relock the timer. It might be not longer hashed. */
-	return lock_timer(timer_id, flags);
+	return lock_timer(timer_id);
 }
 
 /*
@@ -889,7 +886,6 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
 			    struct itimerspec64 *old_spec64)
 {
 	struct k_itimer *timr;
-	unsigned long flags;
 	int error;
 
 	if (!timespec64_valid(&new_spec64->it_interval) ||
@@ -899,7 +895,7 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
 	if (old_spec64)
 		memset(old_spec64, 0, sizeof(*old_spec64));
 
-	timr = lock_timer(timer_id, &flags);
+	timr = lock_timer(timer_id);
 retry:
 	if (!timr)
 		return -EINVAL;
@@ -916,10 +912,10 @@ retry:
 		// We already got the old time...
 		old_spec64 = NULL;
 		/* Unlocks and relocks the timer if it still exists */
-		timr = timer_wait_running(timr, &flags);
+		timr = timer_wait_running(timr);
 		goto retry;
 	}
-	unlock_timer(timr, flags);
+	unlock_timer(timr);
 
 	return error;
 }
@@ -995,10 +991,7 @@ static inline void posix_timer_cleanup_ignored(struct k_itimer *tmr)
 /* Delete a POSIX.1b interval timer. */
 SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
-	struct k_itimer *timer;
-	unsigned long flags;
-
-	timer = lock_timer(timer_id, &flags);
+	struct k_itimer *timer = lock_timer(timer_id);
 
 retry_delete:
 	if (!timer)
@@ -1009,7 +1002,7 @@ retry_delete:
 
 	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
 		/* Unlocks and relocks the timer if it still exists */
-		timer = timer_wait_running(timer, &flags);
+		timer = timer_wait_running(timer);
 		goto retry_delete;
 	}
 
@@ -1028,7 +1021,7 @@ retry_delete:
 		WRITE_ONCE(timer->it_signal, NULL);
 	}
 
-	unlock_timer(timer, flags);
+	unlock_timer(timer);
 	posix_timer_unhash_and_free(timer);
 	return 0;
 }
@@ -1039,12 +1032,7 @@ retry_delete:
  */
 static void itimer_delete(struct k_itimer *timer)
 {
-	unsigned long flags;
-
-	/*
-	 * irqsave is required to make timer_wait_running() work.
-	 */
-	spin_lock_irqsave(&timer->it_lock, flags);
+	spin_lock_irq(&timer->it_lock);
 
 retry_delete:
 	/*
@@ -1065,7 +1053,7 @@ retry_delete:
 		 * do_exit() only for the last thread of the thread group.
 		 * So no other task can access and delete that timer.
 		 */
-		if (WARN_ON_ONCE(timer_wait_running(timer, &flags) != timer))
+		if (WARN_ON_ONCE(timer_wait_running(timer) != timer))
 			return;
 
 		goto retry_delete;
@@ -1082,7 +1070,7 @@ retry_delete:
 	 */
 	WRITE_ONCE(timer->it_signal, NULL);
 
-	spin_unlock_irqrestore(&timer->it_lock, flags);
+	spin_unlock_irq(&timer->it_lock);
 	posix_timer_unhash_and_free(timer);
 }
 

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Use guards in a few places
  2025-03-08 16:48 ` [patch V3 07/18] posix-timers: Use guards in a few places Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     a31a300c4daba82b14eb77179b0b6fc729b9bad5
Gitweb:        https://git.kernel.org/tip/a31a300c4daba82b14eb77179b0b6fc729b9bad5
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:28 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:17 +01:00

posix-timers: Use guards in a few places

Switch locking and RCU to guards where applicable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.892762130@linutronix.de

---
 kernel/time/posix-timers.c | 68 ++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 38 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index b7bf863..988cbfb 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -397,9 +397,8 @@ void posixtimer_free_timer(struct k_itimer *tmr)
 
 static void posix_timer_unhash_and_free(struct k_itimer *tmr)
 {
-	spin_lock(&hash_lock);
-	hlist_del_rcu(&tmr->t_hash);
-	spin_unlock(&hash_lock);
+	scoped_guard (spinlock, &hash_lock)
+		hlist_del_rcu(&tmr->t_hash);
 	posixtimer_putref(tmr);
 }
 
@@ -443,9 +442,8 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 	new_timer->it_overrun = -1LL;
 
 	if (event) {
-		rcu_read_lock();
-		new_timer->it_pid = get_pid(good_sigevent(event));
-		rcu_read_unlock();
+		scoped_guard (rcu)
+			new_timer->it_pid = get_pid(good_sigevent(event));
 		if (!new_timer->it_pid) {
 			error = -EINVAL;
 			goto out;
@@ -579,7 +577,7 @@ static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 	 * can't change, but timr::it_signal becomes NULL during
 	 * destruction.
 	 */
-	rcu_read_lock();
+	guard(rcu)();
 	timr = posix_timer_by_id(timer_id);
 	if (timr) {
 		spin_lock_irqsave(&timr->it_lock, *flags);
@@ -587,14 +585,10 @@ static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 		 * Validate under timr::it_lock that timr::it_signal is
 		 * still valid. Pairs with #1 above.
 		 */
-		if (timr->it_signal == current->signal) {
-			rcu_read_unlock();
+		if (timr->it_signal == current->signal)
 			return timr;
-		}
 		spin_unlock_irqrestore(&timr->it_lock, *flags);
 	}
-	rcu_read_unlock();
-
 	return NULL;
 }
 
@@ -825,16 +819,15 @@ static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 	timer_t timer_id = READ_ONCE(timer->it_id);
 
 	/* Prevent kfree(timer) after dropping the lock */
-	rcu_read_lock();
-	unlock_timer(timer, *flags);
-
-	/*
-	 * kc->timer_wait_running() might drop RCU lock. So @timer
-	 * cannot be touched anymore after the function returns!
-	 */
-	timer->kclock->timer_wait_running(timer);
+	scoped_guard (rcu) {
+		unlock_timer(timer, *flags);
+		/*
+		 * kc->timer_wait_running() might drop RCU lock. So @timer
+		 * cannot be touched anymore after the function returns!
+		 */
+		timer->kclock->timer_wait_running(timer);
+	}
 
-	rcu_read_unlock();
 	/* Relock the timer. It might be not longer hashed. */
 	return lock_timer(timer_id, flags);
 }
@@ -1020,20 +1013,20 @@ retry_delete:
 		goto retry_delete;
 	}
 
-	spin_lock(&current->sighand->siglock);
-	hlist_del(&timer->list);
-	posix_timer_cleanup_ignored(timer);
-	/*
-	 * A concurrent lookup could check timer::it_signal lockless. It
-	 * will reevaluate with timer::it_lock held and observe the NULL.
-	 *
-	 * It must be written with siglock held so that the signal code
-	 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
-	 * which prevents it from moving a pending signal of a deleted
-	 * timer to the ignore list.
-	 */
-	WRITE_ONCE(timer->it_signal, NULL);
-	spin_unlock(&current->sighand->siglock);
+	scoped_guard (spinlock, &current->sighand->siglock) {
+		hlist_del(&timer->list);
+		posix_timer_cleanup_ignored(timer);
+		/*
+		 * A concurrent lookup could check timer::it_signal lockless. It
+		 * will reevaluate with timer::it_lock held and observe the NULL.
+		 *
+		 * It must be written with siglock held so that the signal code
+		 * observes timer->it_signal == NULL in do_sigaction(SIG_IGN),
+		 * which prevents it from moving a pending signal of a deleted
+		 * timer to the ignore list.
+		 */
+		WRITE_ONCE(timer->it_signal, NULL);
+	}
 
 	unlock_timer(timer, flags);
 	posix_timer_unhash_and_free(timer);
@@ -1106,9 +1099,8 @@ void exit_itimers(struct task_struct *tsk)
 		return;
 
 	/* Protect against concurrent read via /proc/$PID/timers */
-	spin_lock_irq(&tsk->sighand->siglock);
-	hlist_move_list(&tsk->signal->posix_timers, &timers);
-	spin_unlock_irq(&tsk->sighand->siglock);
+	scoped_guard (spinlock_irq, &tsk->sighand->siglock)
+		hlist_move_list(&tsk->signal->posix_timers, &timers);
 
 	/* The timers are not longer accessible via tsk::signal */
 	while (!hlist_empty(&timers)) {

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Remove SLAB_PANIC from kmem cache
  2025-03-08 16:48 ` [patch V3 06/18] posix-timers: Remove SLAB_PANIC from kmem cache Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     f6d0c3d2ebb3355dc2b2a9015563cfbae6596417
Gitweb:        https://git.kernel.org/tip/f6d0c3d2ebb3355dc2b2a9015563cfbae6596417
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:26 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:16 +01:00

posix-timers: Remove SLAB_PANIC from kmem cache

There is no need to panic when the posix-timer kmem_cache can't be
created. timer_create() will fail with -ENOMEM and that's it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.829215801@linutronix.de

---
 kernel/time/posix-timers.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 5591b15..b7bf863 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -243,9 +243,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 
 static __init int init_posix_timers(void)
 {
-	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof(struct k_itimer), 0,
-					SLAB_PANIC | SLAB_ACCOUNT, NULL);
+	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer), 0,
+					       SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
@@ -371,8 +370,12 @@ static struct pid *good_sigevent(sigevent_t * event)
 
 static struct k_itimer *alloc_posix_timer(void)
 {
-	struct k_itimer *tmr = kmem_cache_zalloc(posix_timers_cache, GFP_KERNEL);
+	struct k_itimer *tmr;
 
+	if (unlikely(!posix_timers_cache))
+		return NULL;
+
+	tmr = kmem_cache_zalloc(posix_timers_cache, GFP_KERNEL);
 	if (!tmr)
 		return tmr;
 

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Cleanup includes
  2025-03-08 16:48 ` [patch V3 04/18] posix-timers: Cleanup includes Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     6ad9c3380ab03c6ff3217df48a02e851e9b03db7
Gitweb:        https://git.kernel.org/tip/6ad9c3380ab03c6ff3217df48a02e851e9b03db7
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:20 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:16 +01:00

posix-timers: Cleanup includes

Remove pointless includes and sort the remaining ones alphabetically.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.701301552@linutronix.de


---
 kernel/time/posix-timers.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index de25253..e908846 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -9,28 +9,22 @@
  *
  * These are all the functions necessary to implement POSIX clocks & timers
  */
-#include <linux/mm.h>
-#include <linux/interrupt.h>
-#include <linux/slab.h>
-#include <linux/time.h>
-#include <linux/mutex.h>
-#include <linux/sched/task.h>
-
-#include <linux/uaccess.h>
-#include <linux/list.h>
-#include <linux/init.h>
+#include <linux/compat.h>
 #include <linux/compiler.h>
 #include <linux/hash.h>
+#include <linux/hashtable.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/list.h>
+#include <linux/nospec.h>
 #include <linux/posix-clock.h>
 #include <linux/posix-timers.h>
+#include <linux/sched/task.h>
+#include <linux/slab.h>
 #include <linux/syscalls.h>
-#include <linux/wait.h>
-#include <linux/workqueue.h>
-#include <linux/export.h>
-#include <linux/hashtable.h>
-#include <linux/compat.h>
-#include <linux/nospec.h>
+#include <linux/time.h>
 #include <linux/time_namespace.h>
+#include <linux/uaccess.h>
 
 #include "timekeeping.h"
 #include "posix-timers.h"

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Remove a few paranoid warnings
  2025-03-08 16:48 ` [patch V3 05/18] posix-timers: Remove a few paranoid warnings Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, Frederic Weisbecker, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     4c5cd058beb565ea02ff3db9236f01b2b7d78071
Gitweb:        https://git.kernel.org/tip/4c5cd058beb565ea02ff3db9236f01b2b7d78071
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:24 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:16 +01:00

posix-timers: Remove a few paranoid warnings

Warnings about a non-initialized timer or non-existing callbacks are just
useful for implementing new posix clocks, but there a NULL pointer
dereference is expected anyway. :)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.765462334@linutronix.de


---
 kernel/time/posix-timers.c | 37 ++++++++-----------------------------
 1 file changed, 8 insertions(+), 29 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index e908846..5591b15 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -682,7 +682,6 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
 
 static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 {
-	const struct k_clock *kc;
 	struct k_itimer *timr;
 	unsigned long flags;
 	int ret = 0;
@@ -692,11 +691,7 @@ static int do_timer_gettime(timer_t timer_id,  struct itimerspec64 *setting)
 		return -EINVAL;
 
 	memset(setting, 0, sizeof(*setting));
-	kc = timr->kclock;
-	if (WARN_ON_ONCE(!kc || !kc->timer_get))
-		ret = -EINVAL;
-	else
-		kc->timer_get(timr, setting);
+	timr->kclock->timer_get(timr, setting);
 
 	unlock_timer(timr, flags);
 	return ret;
@@ -824,7 +819,6 @@ static void common_timer_wait_running(struct k_itimer *timer)
 static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 					   unsigned long *flags)
 {
-	const struct k_clock *kc = READ_ONCE(timer->kclock);
 	timer_t timer_id = READ_ONCE(timer->it_id);
 
 	/* Prevent kfree(timer) after dropping the lock */
@@ -835,8 +829,7 @@ static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 	 * kc->timer_wait_running() might drop RCU lock. So @timer
 	 * cannot be touched anymore after the function returns!
 	 */
-	if (!WARN_ON_ONCE(!kc->timer_wait_running))
-		kc->timer_wait_running(timer);
+	timer->kclock->timer_wait_running(timer);
 
 	rcu_read_unlock();
 	/* Relock the timer. It might be not longer hashed. */
@@ -899,7 +892,6 @@ static int do_timer_settime(timer_t timer_id, int tmr_flags,
 			    struct itimerspec64 *new_spec64,
 			    struct itimerspec64 *old_spec64)
 {
-	const struct k_clock *kc;
 	struct k_itimer *timr;
 	unsigned long flags;
 	int error;
@@ -922,11 +914,7 @@ retry:
 	/* Prevent signal delivery and rearming. */
 	timr->it_signal_seq++;
 
-	kc = timr->kclock;
-	if (WARN_ON_ONCE(!kc || !kc->timer_set))
-		error = -EINVAL;
-	else
-		error = kc->timer_set(timr, tmr_flags, new_spec64, old_spec64);
+	error = timr->kclock->timer_set(timr, tmr_flags, new_spec64, old_spec64);
 
 	if (error == TIMER_RETRY) {
 		// We already got the old time...
@@ -1008,18 +996,6 @@ static inline void posix_timer_cleanup_ignored(struct k_itimer *tmr)
 	}
 }
 
-static inline int timer_delete_hook(struct k_itimer *timer)
-{
-	const struct k_clock *kc = timer->kclock;
-
-	/* Prevent signal delivery and rearming. */
-	timer->it_signal_seq++;
-
-	if (WARN_ON_ONCE(!kc || !kc->timer_del))
-		return -EINVAL;
-	return kc->timer_del(timer);
-}
-
 /* Delete a POSIX.1b interval timer. */
 SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 {
@@ -1032,7 +1008,10 @@ retry_delete:
 	if (!timer)
 		return -EINVAL;
 
-	if (unlikely(timer_delete_hook(timer) == TIMER_RETRY)) {
+	/* Prevent signal delivery and rearming. */
+	timer->it_signal_seq++;
+
+	if (unlikely(timer->kclock->timer_del(timer) == TIMER_RETRY)) {
 		/* Unlocks and relocks the timer if it still exists */
 		timer = timer_wait_running(timer, &flags);
 		goto retry_delete;
@@ -1078,7 +1057,7 @@ retry_delete:
 	 * mechanism. Worse, that timer mechanism might run the expiry
 	 * function concurrently.
 	 */
-	if (timer_delete_hook(timer) == TIMER_RETRY) {
+	if (timer->kclock->timer_del(timer) == TIMER_RETRY) {
 		/*
 		 * Timer is expired concurrently, prevent livelocks
 		 * and pointless spinning on RT.

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Add cond_resched() to posix_timer_add() search loop
  2025-03-08 16:48 ` [patch V3 03/18] posix-timers: Add cond_resched() to posix_timer_add() search loop Thomas Gleixner
@ 2025-03-13 11:31   ` tip-bot2 for Eric Dumazet
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot2 for Eric Dumazet @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric Dumazet, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     5f2909c6cd13564a07ae692a95457f52295c4f22
Gitweb:        https://git.kernel.org/tip/5f2909c6cd13564a07ae692a95457f52295c4f22
Author:        Eric Dumazet <edumazet@google.com>
AuthorDate:    Sat, 08 Mar 2025 17:48:17 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:16 +01:00

posix-timers: Add cond_resched() to posix_timer_add() search loop

With a large number of POSIX timers the search for a valid ID might cause a
soft lockup on PREEMPT_NONE/VOLUNTARY kernels.

Add cond_resched() to the loop to prevent that.

[ tglx: Split out from Eric's series ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de


---
 kernel/time/posix-timers.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 24d7eab..de25253 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -144,6 +144,7 @@ static int posix_timer_add(struct k_itimer *timer)
 			return id;
 		}
 		spin_unlock(&hash_lock);
+		cond_resched();
 	}
 	/* POSIX return code when no timer ID could be allocated */
 	return -EAGAIN;

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Initialise timer before adding it to the hash table
  2025-03-08 16:48 ` [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table Thomas Gleixner
  2025-03-11 13:25   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Eric Dumazet
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Eric Dumazet @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric Dumazet, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     45ece9933d4a8e0e8b3da8d1c3bbb1878be216c4
Gitweb:        https://git.kernel.org/tip/45ece9933d4a8e0e8b3da8d1c3bbb1878be216c4
Author:        Eric Dumazet <edumazet@google.com>
AuthorDate:    Sat, 08 Mar 2025 17:48:14 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:16 +01:00

posix-timers: Initialise timer before adding it to the hash table

A timer is only valid in the hashtable when both timer::it_signal and
timer::it_id are set to their final values, but timers are added without
those values being set.

The timer ID is allocated when the timer is added to the hash in invalid
state. The ID is taken from a monotonically increasing per process counter
which wraps around after reaching INT_MAX. The hash insertion validates
that there is no timer with the allocated ID in the hash table which
belongs to the same process. That opens a mostly theoretical race condition:

If other threads of the same process manage to create/delete timers in
rapid succession before the newly created timer is fully initialized and
wrap around to the timer ID which was handed out, then a duplicate timer ID
will be inserted into the hash table.

Prevent this by:

  1) Setting timer::it_id before inserting the timer into the hashtable.
 
  2) Storing the signal pointer in timer::it_signal with bit 0 set before
     inserting it into the hashtable.

     Bit 0 acts as a invalid bit, which means that the regular lookup for
     sys_timer_*() will fail the comparison with the signal pointer.

     But the lookup on insertion masks out bit 0 and can therefore detect a
     timer which is not yet valid, but allocated in the hash table.  Bit 0
     in the pointer is cleared once the initialization of the timer
     completed.

[ tglx: Fold ID and signal iniitializaion into one patch and massage change
  	log and comments. ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155623.572035178@linutronix.de

---
 kernel/time/posix-timers.c | 56 +++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 6bf468b..24d7eab 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -72,13 +72,13 @@ static int hash(struct signal_struct *sig, unsigned int nr)
 	return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
 }
 
-static struct k_itimer *__posix_timers_find(struct hlist_head *head,
-					    struct signal_struct *sig,
-					    timer_t id)
+static struct k_itimer *posix_timer_by_id(timer_t id)
 {
+	struct signal_struct *sig = current->signal;
+	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
 	struct k_itimer *timer;
 
-	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+	hlist_for_each_entry_rcu(timer, head, t_hash) {
 		/* timer->it_signal can be set concurrently */
 		if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
 			return timer;
@@ -86,12 +86,26 @@ static struct k_itimer *__posix_timers_find(struct hlist_head *head,
 	return NULL;
 }
 
-static struct k_itimer *posix_timer_by_id(timer_t id)
+static inline struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
 {
-	struct signal_struct *sig = current->signal;
-	struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+	unsigned long val = (unsigned long)timer->it_signal;
+
+	/*
+	 * Mask out bit 0, which acts as invalid marker to prevent
+	 * posix_timer_by_id() detecting it as valid.
+	 */
+	return (struct signal_struct *)(val & ~1UL);
+}
+
+static bool posix_timer_hashed(struct hlist_head *head, struct signal_struct *sig, timer_t id)
+{
+	struct k_itimer *timer;
 
-	return __posix_timers_find(head, sig, id);
+	hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
+		if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
+			return true;
+	}
+	return false;
 }
 
 static int posix_timer_add(struct k_itimer *timer)
@@ -112,7 +126,19 @@ static int posix_timer_add(struct k_itimer *timer)
 		sig->next_posix_timer_id = (id + 1) & INT_MAX;
 
 		head = &posix_timers_hashtable[hash(sig, id)];
-		if (!__posix_timers_find(head, sig, id)) {
+		if (!posix_timer_hashed(head, sig, id)) {
+			/*
+			 * Set the timer ID and the signal pointer to make
+			 * it identifiable in the hash table. The signal
+			 * pointer has bit 0 set to indicate that it is not
+			 * yet fully initialized. posix_timer_hashed()
+			 * masks this bit out, but the syscall lookup fails
+			 * to match due to it being set. This guarantees
+			 * that there can't be duplicate timer IDs handed
+			 * out.
+			 */
+			timer->it_id = (timer_t)id;
+			timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
 			hlist_add_head_rcu(&timer->t_hash, head);
 			spin_unlock(&hash_lock);
 			return id;
@@ -406,8 +432,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 
 	/*
 	 * Add the timer to the hash table. The timer is not yet valid
-	 * because new_timer::it_signal is still NULL. The timer id is also
-	 * not yet visible to user space.
+	 * after insertion, but has a unique ID allocated.
 	 */
 	new_timer_id = posix_timer_add(new_timer);
 	if (new_timer_id < 0) {
@@ -415,7 +440,6 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 		return new_timer_id;
 	}
 
-	new_timer->it_id = (timer_t) new_timer_id;
 	new_timer->it_clock = which_clock;
 	new_timer->kclock = kc;
 	new_timer->it_overrun = -1LL;
@@ -453,7 +477,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 	}
 	/*
 	 * After succesful copy out, the timer ID is visible to user space
-	 * now but not yet valid because new_timer::signal is still NULL.
+	 * now but not yet valid because new_timer::signal low order bit is 1.
 	 *
 	 * Complete the initialization with the clock specific create
 	 * callback.
@@ -470,7 +494,11 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 	 */
 	scoped_guard (spinlock_irq, &new_timer->it_lock) {
 		guard(spinlock)(&current->sighand->siglock);
-		/* This makes the timer valid in the hash table */
+		/*
+		 * new_timer::it_signal contains the signal pointer with
+		 * bit 0 set, which makes it invalid for syscall operations.
+		 * Store the unmodified signal pointer to make it valid.
+		 */
 		WRITE_ONCE(new_timer->it_signal, current->signal);
 		hlist_add_head(&new_timer->list, &current->signal->posix_timers);
 	}

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [tip: timers/core] posix-timers: Ensure that timer initialization is fully visible
  2025-03-08 16:48 ` [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible Thomas Gleixner
  2025-03-08 21:39   ` Frederic Weisbecker
@ 2025-03-13 11:31   ` tip-bot2 for Thomas Gleixner
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-03-13 11:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     2389c6efd3ad8edb3bcce0019b4edcc7d9c7de19
Gitweb:        https://git.kernel.org/tip/2389c6efd3ad8edb3bcce0019b4edcc7d9c7de19
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 08 Mar 2025 17:48:10 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 13 Mar 2025 12:07:16 +01:00

posix-timers: Ensure that timer initialization is fully visible

Frederic pointed out that the memory operations to initialize the timer are
not guaranteed to be visible, when __lock_timer() observes timer::it_signal
valid under timer::it_lock:

  T0                                      T1
  ---------                               -----------
  do_timer_create()
      // A
      new_timer->.... = ....
      spin_lock(current->sighand)
      // B
      WRITE_ONCE(new_timer->it_signal, current->signal)
      spin_unlock(current->sighand)
					sys_timer_*()
					   t =  __lock_timer()
						  spin_lock(&timr->it_lock)
						  // observes B
						  if (timr->it_signal == current->signal)
						    return timr;
			                   if (!t)
					       return;
					// Is not guaranteed to observe A

Protect the write of timer::it_signal, which makes the timer valid, with
timer::it_lock as well. This guarantees that T1 must observe the
initialization A completely, when it observes the valid signal pointer
under timer::it_lock. sighand::siglock must still be taken to protect the
signal::posix_timers list.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.507944489@linutronix.de

---
 kernel/time/posix-timers.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 44ba7db..6bf468b 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -462,14 +462,21 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
 	if (error)
 		goto out;
 
-	spin_lock_irq(&current->sighand->siglock);
-	/* This makes the timer valid in the hash table */
-	WRITE_ONCE(new_timer->it_signal, current->signal);
-	hlist_add_head(&new_timer->list, &current->signal->posix_timers);
-	spin_unlock_irq(&current->sighand->siglock);
 	/*
-	 * After unlocking sighand::siglock @new_timer is subject to
-	 * concurrent removal and cannot be touched anymore
+	 * timer::it_lock ensures that __lock_timer() observes a fully
+	 * initialized timer when it observes a valid timer::it_signal.
+	 *
+	 * sighand::siglock is required to protect signal::posix_timers.
+	 */
+	scoped_guard (spinlock_irq, &new_timer->it_lock) {
+		guard(spinlock)(&current->sighand->siglock);
+		/* This makes the timer valid in the hash table */
+		WRITE_ONCE(new_timer->it_signal, current->signal);
+		hlist_add_head(&new_timer->list, &current->signal->posix_timers);
+	}
+	/*
+	 * After unlocking @new_timer is subject to concurrent removal and
+	 * cannot be touched anymore
 	 */
 	return 0;
 out:

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [patch V3 14/18] posix-timers: Avoid false cacheline sharing
  2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
  2025-03-11 13:53   ` Frederic Weisbecker
  2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
@ 2025-03-13 22:13   ` David Laight
  2025-03-17  6:20   ` Nysal Jan K.A.
  3 siblings, 0 replies; 68+ messages in thread
From: David Laight @ 2025-03-13 22:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

On Sat,  8 Mar 2025 17:48:42 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> struct k_itimer has the hlist_node, which is used for lookup in the hash
> bucket, and the timer lock in the same cache line.
> 
> That's obviously bad, if one CPU fiddles with a timer and the other is
> walking the hash bucket on which that timer is queued.
> 
> Avoid this by restructuring struct k_itimer, so that the read mostly (only
> modified during setup and teardown) fields are in the first cache line and
> the lock and the rest of the fields which get written to are in cacheline
> 2-N.

How big is the structure?
If I count it correctly the first 'cacheline' is 64 bytes on 64bit
(and somewhat smaller on 32bit - if anyone cares).

But there are some cpu (probably ppc) with quite large cache lines.
In that case you either need to waste the space by aligning the 2nd
part the structure into an actual cache line, or just align the
structure to a 64 byte boundary.

	David

> 
> Reduces cacheline contention in a test case of 64 processes creating and
> accessing 20000 timers each by almost 30% according to perf.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
> V2: New patch
> ---
>  include/linux/posix-timers.h |   21 ++++++++++++---------
>  kernel/time/posix-timers.c   |    4 ++--
>  2 files changed, 14 insertions(+), 11 deletions(-)
> 
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -177,23 +177,26 @@ static inline void posix_cputimers_init_
>   * @rcu:		RCU head for freeing the timer.
>   */
>  struct k_itimer {
> -	struct hlist_node	list;
> -	struct hlist_node	ignored_list;
> +	/* 1st cacheline contains read-mostly fields */
>  	struct hlist_node	t_hash;
> -	spinlock_t		it_lock;
> -	const struct k_clock	*kclock;
> -	clockid_t		it_clock;
> +	struct hlist_node	list;
>  	timer_t			it_id;
> +	clockid_t		it_clock;
> +	int			it_sigev_notify;
> +	enum pid_type		it_pid_type;
> +	struct signal_struct	*it_signal;
> +	const struct k_clock	*kclock;
> +
> +	/* 2nd cacheline and above contain fields which are modified regularly */
> +	spinlock_t		it_lock;
>  	int			it_status;
>  	bool			it_sig_periodic;
>  	s64			it_overrun;
>  	s64			it_overrun_last;
>  	unsigned int		it_signal_seq;
>  	unsigned int		it_sigqueue_seq;
> -	int			it_sigev_notify;
> -	enum pid_type		it_pid_type;
>  	ktime_t			it_interval;
> -	struct signal_struct	*it_signal;
> +	struct hlist_node	ignored_list;
>  	union {
>  		struct pid		*it_pid;
>  		struct task_struct	*it_process;
> @@ -210,7 +213,7 @@ struct k_itimer {
>  		} alarm;
>  	} it;
>  	struct rcu_head		rcu;
> -};
> +} ____cacheline_aligned_in_smp;
>  
>  void run_posix_cpu_timers(void);
>  void posix_cpu_timers_exit(struct task_struct *task);
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -260,8 +260,8 @@ static int posix_get_hrtimer_res(clockid
>  
>  static __init int init_posix_timers(void)
>  {
> -	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer), 0,
> -					       SLAB_ACCOUNT, NULL);
> +	posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof(struct k_itimer),
> +					       __alignof__(struct k_itimer), SLAB_ACCOUNT, NULL);
>  	return 0;
>  }
>  __initcall(init_posix_timers);
> 
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch V3 14/18] posix-timers: Avoid false cacheline sharing
  2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
                     ` (2 preceding siblings ...)
  2025-03-13 22:13   ` [patch V3 14/18] " David Laight
@ 2025-03-17  6:20   ` Nysal Jan K.A.
  3 siblings, 0 replies; 68+ messages in thread
From: Nysal Jan K.A. @ 2025-03-17  6:20 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Anna-Maria Behnsen, Frederic Weisbecker, Benjamin Segall,
	Eric Dumazet, Andrey Vagin, Pavel Tikhomirov, Peter Zijlstra,
	Cyrill Gorcunov

On Sat, Mar 08, 2025 at 05:48:42PM +0100, Thomas Gleixner wrote:
> ---
> V2: New patch
> ---
>  include/linux/posix-timers.h |   21 ++++++++++++---------
>  kernel/time/posix-timers.c   |    4 ++--
>  2 files changed, 14 insertions(+), 11 deletions(-)
> 
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -177,23 +177,26 @@ static inline void posix_cputimers_init_
>   * @rcu:		RCU head for freeing the timer.
>   */
>  struct k_itimer {
> -	struct hlist_node	list;
> -	struct hlist_node	ignored_list;
> +	/* 1st cacheline contains read-mostly fields */
>  	struct hlist_node	t_hash;
> -	spinlock_t		it_lock;
> -	const struct k_clock	*kclock;
> -	clockid_t		it_clock;
> +	struct hlist_node	list;
>  	timer_t			it_id;
> +	clockid_t		it_clock;
> +	int			it_sigev_notify;
> +	enum pid_type		it_pid_type;
> +	struct signal_struct	*it_signal;
> +	const struct k_clock	*kclock;
> +
> +	/* 2nd cacheline and above contain fields which are modified regularly */

On architectures like powerpc where cache line size is 128 bytes, we might still
run into false sharing. Perhaps rearranging it towards the end of the struct might
help avoid it? Is the benchmark code public? I can collect perf c2c data on powerpc.

> +	spinlock_t		it_lock;
>  	int			it_status;
>  	bool			it_sig_periodic;
>  	s64			it_overrun;
>  	s64			it_overrun_last;
>  	unsigned int		it_signal_seq;
>  	unsigned int		it_sigqueue_seq;
> -	int			it_sigev_notify;
> -	enum pid_type		it_pid_type;
>  	ktime_t			it_interval;
> -	struct signal_struct	*it_signal;
> +	struct hlist_node	ignored_list;
>  	union {
>  		struct pid		*it_pid;
>  		struct task_struct	*it_process;
> @@ -210,7 +213,7 @@ struct k_itimer {
>  		} alarm;
>  	} it;
>  	struct rcu_head		rcu;
> -};
> +} ____cacheline_aligned_in_smp;
>  

--Nysal

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-03-17  6:20 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-08 16:48 [patch V3 00/18] posix-timers: Rework the global hash table and provide a sane mechanism for CRIU Thomas Gleixner
2025-03-08 16:48 ` [patch V3 01/18] posix-timers: Ensure that timer initialization is fully visible Thomas Gleixner
2025-03-08 21:39   ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 02/18] posix-timers: Initialise timer before adding it to the hash table Thomas Gleixner
2025-03-11 13:25   ` Frederic Weisbecker
2025-03-11 14:16     ` Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
2025-03-08 16:48 ` [patch V3 03/18] posix-timers: Add cond_resched() to posix_timer_add() search loop Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
2025-03-08 16:48 ` [patch V3 04/18] posix-timers: Cleanup includes Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 05/18] posix-timers: Remove a few paranoid warnings Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 06/18] posix-timers: Remove SLAB_PANIC from kmem cache Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 07/18] posix-timers: Use guards in a few places Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 08/18] posix-timers: Simplify lock/unlock_timer() Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 09/18] posix-timers: Rework timer removal Thomas Gleixner
2025-03-09 23:17   ` Frederic Weisbecker
2025-03-10  6:33     ` Thomas Gleixner
2025-03-10  8:13   ` [patch V3a " Thomas Gleixner
2025-03-13 11:31     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 10/18] posix-timers: Make lock_timer() use guard() Thomas Gleixner
2025-03-10 11:57   ` Frederic Weisbecker
2025-03-10 17:36     ` Thomas Gleixner
2025-03-10 22:16       ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Peter Zijlstra
2025-03-08 16:48 ` [patch V3 11/18] posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t Thomas Gleixner
2025-03-10 22:57   ` Frederic Weisbecker
2025-03-11 13:41   ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Eric Dumazet
2025-03-08 16:48 ` [patch V3 12/18] posix-timers: Improve hash table performance Thomas Gleixner
2025-03-11 13:44   ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 13/18] posix-timers: Switch to jhash32() Thomas Gleixner
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 14/18] posix-timers: Avoid false cacheline sharing Thomas Gleixner
2025-03-11 13:53   ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-13 22:13   ` [patch V3 14/18] " David Laight
2025-03-17  6:20   ` Nysal Jan K.A.
2025-03-08 16:48 ` [patch V3 15/18] posix-timers: Make per process list RCU safe Thomas Gleixner
2025-03-11 15:29   ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 16/18] posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held Thomas Gleixner
2025-03-08 22:38   ` Cyrill Gorcunov
2025-03-11 15:26   ` Frederic Weisbecker
2025-03-13 11:31   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-08 16:48 ` [patch V3 17/18] posix-timers: Provide a mechanism to allocate a given timer ID Thomas Gleixner
2025-03-08 22:25   ` Cyrill Gorcunov
2025-03-11 21:35   ` Frederic Weisbecker
2025-03-11 22:05     ` Thomas Gleixner
2025-03-11 22:07       ` [patch V3a " Thomas Gleixner
2025-03-11 22:32         ` Frederic Weisbecker
2025-03-12  7:56           ` Cyrill Gorcunov
2025-03-12 11:24             ` Thomas Gleixner
2025-03-12 11:31               ` Thomas Gleixner
2025-03-12 12:41               ` Cyrill Gorcunov
2025-03-12 17:45                 ` Thomas Gleixner
2025-03-13 11:31         ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2025-03-12 12:59     ` [patch V3 17/18] " Cyrill Gorcunov
2025-03-08 16:48 ` [patch V3 18/18] selftests/timers/posix-timers: Add a test for exact allocation mode Thomas Gleixner
2025-03-10  8:11   ` [patch V3a " Thomas Gleixner
2025-03-11 21:44     ` Frederic Weisbecker
2025-03-13 11:31     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).