* [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-19 12:55 [PATCH V2 0/4] posix-timers: Reduce spinlock contention Eric Dumazet
@ 2025-02-19 12:55 ` Eric Dumazet
2025-02-20 8:09 ` Thomas Gleixner
2025-02-19 12:55 ` [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add() Eric Dumazet
` (2 subsequent siblings)
3 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-19 12:55 UTC (permalink / raw)
To: Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
Instead of relying on a global and shared hash_lock
to protect sig->next_posix_timer_id, make it atomic.
This allows the following patch to use RCU.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/sched/signal.h | 2 +-
kernel/time/posix-timers.c | 9 +++++----
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d5d03d919df8..72649d7fce2a 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -136,7 +136,7 @@ struct signal_struct {
#ifdef CONFIG_POSIX_TIMERS
/* POSIX.1b Interval Timers */
- unsigned int next_posix_timer_id;
+ atomic_t next_posix_timer_id;
struct hlist_head posix_timers;
struct hlist_head ignored_posix_timers;
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 1b675aee99a9..204a351a2fd3 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -105,13 +105,14 @@ static int posix_timer_add(struct k_itimer *timer)
* a plan to handle the resulting CRIU regression gracefully.
*/
for (cnt = 0; cnt <= INT_MAX; cnt++) {
- spin_lock(&hash_lock);
- id = sig->next_posix_timer_id;
+ id = atomic_inc_return(&sig->next_posix_timer_id) - 1;
- /* Write the next ID back. Clamp it to the positive space */
- sig->next_posix_timer_id = (id + 1) & INT_MAX;
+ /* Clamp id to the positive space */
+ id &= INT_MAX;
head = &posix_timers_hashtable[hash(sig, id)];
+
+ spin_lock(&hash_lock);
if (!__posix_timers_find(head, sig, id)) {
hlist_add_head_rcu(&timer->t_hash, head);
spin_unlock(&hash_lock);
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-19 12:55 ` [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t Eric Dumazet
@ 2025-02-20 8:09 ` Thomas Gleixner
2025-02-20 8:49 ` Eric Dumazet
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 8:09 UTC (permalink / raw)
To: Eric Dumazet, Anna-Maria Behnsen, Frederic Weisbecker
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> Instead of relying on a global and shared hash_lock
> to protect sig->next_posix_timer_id, make it atomic.
>
> This allows the following patch to use RCU.
Your patch ordering is slightly off by two :)
And it fails to explain for what RCU can be used....
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-20 8:09 ` Thomas Gleixner
@ 2025-02-20 8:49 ` Eric Dumazet
2025-02-20 14:04 ` Thomas Gleixner
0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-20 8:49 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20, 2025 at 9:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> > Instead of relying on a global and shared hash_lock
> > to protect sig->next_posix_timer_id, make it atomic.
> >
> > This allows the following patch to use RCU.
>
> Your patch ordering is slightly off by two :)
>
> And it fails to explain for what RCU can be used....
This is explained in the following patches.
If I add nothing in the changelog, you complain the changelog is not
explaining anything.
I suggest you write the patches. because I feel a huge resistance,
which I do not understand.
Thank you.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-20 8:49 ` Eric Dumazet
@ 2025-02-20 14:04 ` Thomas Gleixner
2025-02-20 14:32 ` Thomas Gleixner
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 14:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20 2025 at 09:49, Eric Dumazet wrote:
> On Thu, Feb 20, 2025 at 9:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > This allows the following patch to use RCU.
>>
>> Your patch ordering is slightly off by two :)
>>
>> And it fails to explain for what RCU can be used....
>
> This is explained in the following patches.
The changelog of a patch has to be self contained. The 'following patch'
has no meaning when the patch is merged.
> If I add nothing in the changelog, you complain the changelog is not
> explaining anything.
>
> I suggest you write the patches. because I feel a huge resistance,
> which I do not understand.
I'm just asking that I get properly written change logs which adhere to
the documented change log requirements.
How does that qualify as resistance?
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-20 14:04 ` Thomas Gleixner
@ 2025-02-20 14:32 ` Thomas Gleixner
2025-02-20 15:55 ` Eric Dumazet
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 14:32 UTC (permalink / raw)
To: Eric Dumazet
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20 2025 at 15:04, Thomas Gleixner wrote:
> On Thu, Feb 20 2025 at 09:49, Eric Dumazet wrote:
>> On Thu, Feb 20, 2025 at 9:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> > This allows the following patch to use RCU.
>>>
>>> Your patch ordering is slightly off by two :)
>>>
>>> And it fails to explain for what RCU can be used....
>>
>> This is explained in the following patches.
>
> The changelog of a patch has to be self contained. The 'following patch'
> has no meaning when the patch is merged.
That said, please just fold this into the patch which actually does this RCU
lookup upfront. The change is trivial enough that it does not really
require to be seperate. If the lockless increment would cause issues,
then the subsequent RCU lookup is the least of the worries :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-20 14:32 ` Thomas Gleixner
@ 2025-02-20 15:55 ` Eric Dumazet
2025-02-20 16:19 ` Thomas Gleixner
0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-20 15:55 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20, 2025 at 3:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Feb 20 2025 at 15:04, Thomas Gleixner wrote:
> > On Thu, Feb 20 2025 at 09:49, Eric Dumazet wrote:
> >> On Thu, Feb 20, 2025 at 9:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> > This allows the following patch to use RCU.
> >>>
> >>> Your patch ordering is slightly off by two :)
> >>>
> >>> And it fails to explain for what RCU can be used....
> >>
> >> This is explained in the following patches.
> >
> > The changelog of a patch has to be self contained. The 'following patch'
> > has no meaning when the patch is merged.
>
> That said, please just fold this into the patch which actually does this RCU
> lookup upfront. The change is trivial enough that it does not really
> require to be seperate. If the lockless increment would cause issues,
> then the subsequent RCU lookup is the least of the worries :)
I can squash all patches into a single one if you prefer.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t
2025-02-20 15:55 ` Eric Dumazet
@ 2025-02-20 16:19 ` Thomas Gleixner
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 16:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20 2025 at 16:55, Eric Dumazet wrote:
> On Thu, Feb 20, 2025 at 3:32 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Thu, Feb 20 2025 at 15:04, Thomas Gleixner wrote:
>> > On Thu, Feb 20 2025 at 09:49, Eric Dumazet wrote:
>> >> On Thu, Feb 20, 2025 at 9:09 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >>> > This allows the following patch to use RCU.
>> >>>
>> >>> Your patch ordering is slightly off by two :)
>> >>>
>> >>> And it fails to explain for what RCU can be used....
>> >>
>> >> This is explained in the following patches.
>> >
>> > The changelog of a patch has to be self contained. The 'following patch'
>> > has no meaning when the patch is merged.
>>
>> That said, please just fold this into the patch which actually does this RCU
>> lookup upfront. The change is trivial enough that it does not really
>> require to be seperate. If the lockless increment would cause issues,
>> then the subsequent RCU lookup is the least of the worries :)
>
> I can squash all patches into a single one if you prefer.
Please don't. The wraparound race prevention has nothing to do with the
lookup optimization.
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add()
2025-02-19 12:55 [PATCH V2 0/4] posix-timers: Reduce spinlock contention Eric Dumazet
2025-02-19 12:55 ` [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t Eric Dumazet
@ 2025-02-19 12:55 ` Eric Dumazet
2025-02-20 8:12 ` Thomas Gleixner
2025-02-19 12:55 ` [PATCH V2 3/4] posix-timers: Initialise timer->it_signal " Eric Dumazet
2025-02-19 12:55 ` [PATCH V2 4/4] posix-timers: Use RCU " Eric Dumazet
3 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-19 12:55 UTC (permalink / raw)
To: Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
A timer is visible only when both timer->signal and timer->i_id
are set to their final values.
We can initialize timer->it_id sooner.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
kernel/time/posix-timers.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 204a351a2fd3..1f73ea955756 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -114,6 +114,7 @@ static int posix_timer_add(struct k_itimer *timer)
spin_lock(&hash_lock);
if (!__posix_timers_find(head, sig, id)) {
+ timer->it_id = (timer_t)id;
hlist_add_head_rcu(&timer->t_hash, head);
spin_unlock(&hash_lock);
return id;
@@ -407,8 +408,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
/*
* Add the timer to the hash table. The timer is not yet valid
- * because new_timer::it_signal is still NULL. The timer id is also
- * not yet visible to user space.
+ * because new_timer::it_signal is still NULL.
*/
new_timer_id = posix_timer_add(new_timer);
if (new_timer_id < 0) {
@@ -416,7 +416,6 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
return new_timer_id;
}
- new_timer->it_id = (timer_t) new_timer_id;
new_timer->it_clock = which_clock;
new_timer->kclock = kc;
new_timer->it_overrun = -1LL;
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add()
2025-02-19 12:55 ` [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add() Eric Dumazet
@ 2025-02-20 8:12 ` Thomas Gleixner
2025-02-20 8:48 ` Eric Dumazet
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 8:12 UTC (permalink / raw)
To: Eric Dumazet, Anna-Maria Behnsen, Frederic Weisbecker
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> A timer is visible only when both timer->signal and timer->i_id
> are set to their final values.
>
> We can initialize timer->it_id sooner.
This changelog tells nothing. What is this fixing, why is it required to
initialize this sooner?
We can... We also can not ... Aside of that please write changelogs in
imperative mood as Documentation asks for.
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> kernel/time/posix-timers.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index 204a351a2fd3..1f73ea955756 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -114,6 +114,7 @@ static int posix_timer_add(struct k_itimer *timer)
>
> spin_lock(&hash_lock);
> if (!__posix_timers_find(head, sig, id)) {
> + timer->it_id = (timer_t)id;
> hlist_add_head_rcu(&timer->t_hash, head);
> spin_unlock(&hash_lock);
> return id;
> @@ -407,8 +408,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
>
> /*
> * Add the timer to the hash table. The timer is not yet valid
> - * because new_timer::it_signal is still NULL. The timer id is also
> - * not yet visible to user space.
Even with this change the timer ID is not yet visible to user space
because it has not yet been copied back ....
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add()
2025-02-20 8:12 ` Thomas Gleixner
@ 2025-02-20 8:48 ` Eric Dumazet
2025-02-20 14:10 ` Thomas Gleixner
0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-20 8:48 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20, 2025 at 9:12 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> > A timer is visible only when both timer->signal and timer->i_id
> > are set to their final values.
> >
> > We can initialize timer->it_id sooner.
>
> This changelog tells nothing. What is this fixing, why is it required to
> initialize this sooner?
Just to make the series more readable and bisectable.
>
> We can... We also can not ... Aside of that please write changelogs in
> imperative mood as Documentation asks for.
I do not understand.
I wrote more than 5000 upstream linux patches, you are trying to
educate me in 2025 ?
>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > kernel/time/posix-timers.c | 5 ++---
> > 1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> > index 204a351a2fd3..1f73ea955756 100644
> > --- a/kernel/time/posix-timers.c
> > +++ b/kernel/time/posix-timers.c
> > @@ -114,6 +114,7 @@ static int posix_timer_add(struct k_itimer *timer)
> >
> > spin_lock(&hash_lock);
> > if (!__posix_timers_find(head, sig, id)) {
> > + timer->it_id = (timer_t)id;
> > hlist_add_head_rcu(&timer->t_hash, head);
> > spin_unlock(&hash_lock);
> > return id;
> > @@ -407,8 +408,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
> >
> > /*
> > * Add the timer to the hash table. The timer is not yet valid
> > - * because new_timer::it_signal is still NULL. The timer id is also
> > - * not yet visible to user space.
>
> Even with this change the timer ID is not yet visible to user space
> because it has not yet been copied back ....
>
It is seen in concurrent lookups.
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add()
2025-02-20 8:48 ` Eric Dumazet
@ 2025-02-20 14:10 ` Thomas Gleixner
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 14:10 UTC (permalink / raw)
To: Eric Dumazet
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20 2025 at 09:48, Eric Dumazet wrote:
> On Thu, Feb 20, 2025 at 9:12 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
>> > A timer is visible only when both timer->signal and timer->i_id
>> > are set to their final values.
>> >
>> > We can initialize timer->it_id sooner.
>>
>> This changelog tells nothing. What is this fixing, why is it required to
>> initialize this sooner?
>
> Just to make the series more readable and bisectable.
That's fine. But this changelog does not explain what the patch is about.
>> We can... We also can not ... Aside of that please write changelogs in
>> imperative mood as Documentation asks for.
>
> I do not understand.
What's so hard to understand?
What does "We can initialize timer->it_id sooner" tell me?
Yes, we can initialize it sooner, but what's the point? What's the
problem this is solving?
Also if you can't find the relevant documentation about imperative mood,
let me cite it to you:
"Describe your changes in imperative mood, e.g. “make xyzzy do frotz”
instead of “[This patch] makes xyzzy do frotz” or “[I] changed xyzzy to
do frotz”, as if you are giving orders to the codebase to change its
behaviour."
> I wrote more than 5000 upstream linux patches, you are trying to
> educate me in 2025 ?
No. I'm asking you to provide proper change logs which match what's
requested in documentation. I'm asking that from anyone independent of
patch count.
>> > /*
>> > * Add the timer to the hash table. The timer is not yet valid
>> > - * because new_timer::it_signal is still NULL. The timer id is also
>> > - * not yet visible to user space.
>>
>> Even with this change the timer ID is not yet visible to user space
>> because it has not yet been copied back ....
>>
> It is seen in concurrent lookups.
But still user space cannot observe it, right?
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH V2 3/4] posix-timers: Initialise timer->it_signal in posix_timer_add()
2025-02-19 12:55 [PATCH V2 0/4] posix-timers: Reduce spinlock contention Eric Dumazet
2025-02-19 12:55 ` [PATCH V2 1/4] posix-timers: Make next_posix_timer_id an atomic_t Eric Dumazet
2025-02-19 12:55 ` [PATCH V2 2/4] posix-timers: Initialise timer->it_id in posix_timer_add() Eric Dumazet
@ 2025-02-19 12:55 ` Eric Dumazet
2025-02-20 8:19 ` Thomas Gleixner
2025-02-19 12:55 ` [PATCH V2 4/4] posix-timers: Use RCU " Eric Dumazet
3 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-19 12:55 UTC (permalink / raw)
To: Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
Instead of leaving a NULL value in timer->it_signal,
set it to the current sig pointer, but with the low order bit set.
This fixes a potential race, in the unlikely case a thread
was preempted long enough that other threads created more than
2^31 itimers.
Rename __posix_timers_find() to posix_timers_find()
Mask the low order bit in posix_timers_find().
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
kernel/time/posix-timers.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 1f73ea955756..ed27c7eab456 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -72,15 +72,22 @@ static int hash(struct signal_struct *sig, unsigned int nr)
return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
}
-static struct k_itimer *__posix_timers_find(struct hlist_head *head,
+static struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
+{
+ /* timer->it_signal can be set concurrently */
+ unsigned long val = (unsigned long)READ_ONCE(timer->it_signal);
+
+ return (struct signal_struct *)(val & ~1UL);
+}
+
+static struct k_itimer *posix_timers_find(struct hlist_head *head,
struct signal_struct *sig,
timer_t id)
{
struct k_itimer *timer;
hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
- /* timer->it_signal can be set concurrently */
- if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
+ if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
return timer;
}
return NULL;
@@ -90,8 +97,14 @@ static struct k_itimer *posix_timer_by_id(timer_t id)
{
struct signal_struct *sig = current->signal;
struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
+ struct k_itimer *timer;
- return __posix_timers_find(head, sig, id);
+ hlist_for_each_entry_rcu(timer, head, t_hash) {
+ /* timer->it_signal can be set concurrently */
+ if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
+ return timer;
+ }
+ return NULL;
}
static int posix_timer_add(struct k_itimer *timer)
@@ -113,8 +126,9 @@ static int posix_timer_add(struct k_itimer *timer)
head = &posix_timers_hashtable[hash(sig, id)];
spin_lock(&hash_lock);
- if (!__posix_timers_find(head, sig, id)) {
+ if (!posix_timers_find(head, sig, id)) {
timer->it_id = (timer_t)id;
+ timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
hlist_add_head_rcu(&timer->t_hash, head);
spin_unlock(&hash_lock);
return id;
@@ -453,7 +467,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
}
/*
* After succesful copy out, the timer ID is visible to user space
- * now but not yet valid because new_timer::signal is still NULL.
+ * now but not yet valid because new_timer::signal low order bit is 1.
*
* Complete the initialization with the clock specific create
* callback.
@@ -463,7 +477,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
goto out;
spin_lock_irq(¤t->sighand->siglock);
- /* This makes the timer valid in the hash table */
+ /* This makes the timer valid in the hash table, clearing low order bit. */
WRITE_ONCE(new_timer->it_signal, current->signal);
hlist_add_head(&new_timer->list, ¤t->signal->posix_timers);
spin_unlock_irq(¤t->sighand->siglock);
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH V2 3/4] posix-timers: Initialise timer->it_signal in posix_timer_add()
2025-02-19 12:55 ` [PATCH V2 3/4] posix-timers: Initialise timer->it_signal " Eric Dumazet
@ 2025-02-20 8:19 ` Thomas Gleixner
2025-02-20 8:44 ` Eric Dumazet
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 8:19 UTC (permalink / raw)
To: Eric Dumazet, Anna-Maria Behnsen, Frederic Weisbecker
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> Instead of leaving a NULL value in timer->it_signal,
> set it to the current sig pointer, but with the low order bit set.
And that low order bit set does what?
> This fixes a potential race, in the unlikely case a thread
> was preempted long enough that other threads created more than
> 2^31 itimers.
and then what happens?
> Rename __posix_timers_find() to posix_timers_find()
That's not what the patch does. It renames to posix_sig_owner(). Aside
of that the rename is not relevant to the problem itself.
> Mask the low order bit in posix_timers_find().
What for?
I pointed you before to the changelog documentation, which clearly says:
A good structure is to explain the context, the problem and the
solution in separate paragraphs and this order.
It's not asked too much to write proper change logs.
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> kernel/time/posix-timers.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index 1f73ea955756..ed27c7eab456 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -72,15 +72,22 @@ static int hash(struct signal_struct *sig, unsigned int nr)
> return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
> }
>
> -static struct k_itimer *__posix_timers_find(struct hlist_head *head,
> +static struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
> +{
> + /* timer->it_signal can be set concurrently */
> + unsigned long val = (unsigned long)READ_ONCE(timer->it_signal);
> +
> + return (struct signal_struct *)(val & ~1UL);
> +}
> +
> +static struct k_itimer *posix_timers_find(struct hlist_head *head,
> struct signal_struct *sig,
> timer_t id)
> {
> struct k_itimer *timer;
>
> hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
> - /* timer->it_signal can be set concurrently */
> - if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
> + if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
> return timer;
> }
> return NULL;
> @@ -90,8 +97,14 @@ static struct k_itimer *posix_timer_by_id(timer_t id)
> {
> struct signal_struct *sig = current->signal;
> struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
> + struct k_itimer *timer;
>
> - return __posix_timers_find(head, sig, id);
> + hlist_for_each_entry_rcu(timer, head, t_hash) {
> + /* timer->it_signal can be set concurrently */
> + if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
> + return timer;
> + }
> + return NULL;
> }
>
> static int posix_timer_add(struct k_itimer *timer)
> @@ -113,8 +126,9 @@ static int posix_timer_add(struct k_itimer *timer)
> head = &posix_timers_hashtable[hash(sig, id)];
>
> spin_lock(&hash_lock);
> - if (!__posix_timers_find(head, sig, id)) {
> + if (!posix_timers_find(head, sig, id)) {
> timer->it_id = (timer_t)id;
> + timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
> hlist_add_head_rcu(&timer->t_hash, head);
> spin_unlock(&hash_lock);
> return id;
> @@ -453,7 +467,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
> }
> /*
> * After succesful copy out, the timer ID is visible to user space
> - * now but not yet valid because new_timer::signal is still NULL.
> + * now but not yet valid because new_timer::signal low order bit is 1.
> *
> * Complete the initialization with the clock specific create
> * callback.
> @@ -463,7 +477,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
> goto out;
>
> spin_lock_irq(¤t->sighand->siglock);
> - /* This makes the timer valid in the hash table */
> + /* This makes the timer valid in the hash table, clearing low order bit. */
Clearing the low order bit of what? This is a full write and not a clear
low order bit operation.
> WRITE_ONCE(new_timer->it_signal, current->signal);
> hlist_add_head(&new_timer->list, ¤t->signal->posix_timers);
> spin_unlock_irq(¤t->sighand->siglock);
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 3/4] posix-timers: Initialise timer->it_signal in posix_timer_add()
2025-02-20 8:19 ` Thomas Gleixner
@ 2025-02-20 8:44 ` Eric Dumazet
2025-02-20 14:13 ` Thomas Gleixner
0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-02-20 8:44 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20, 2025 at 9:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> > Instead of leaving a NULL value in timer->it_signal,
> > set it to the current sig pointer, but with the low order bit set.
>
> And that low order bit set does what?
>
> > This fixes a potential race, in the unlikely case a thread
> > was preempted long enough that other threads created more than
> > 2^31 itimers.
>
> and then what happens?
Two threads might get the same timer_id given back.
>
> > Rename __posix_timers_find() to posix_timers_find()
>
> That's not what the patch does. It renames to posix_sig_owner(). Aside
> of that the rename is not relevant to the problem itself.
posix_sig_owner() is a new helper, to remove the low order bit.
>
> > Mask the low order bit in posix_timers_find().
>
> What for?
>
> I pointed you before to the changelog documentation, which clearly says:
>
> A good structure is to explain the context, the problem and the
> solution in separate paragraphs and this order.
>
> It's not asked too much to write proper change logs.
Ok.
>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > kernel/time/posix-timers.c | 28 +++++++++++++++++++++-------
> > 1 file changed, 21 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> > index 1f73ea955756..ed27c7eab456 100644
> > --- a/kernel/time/posix-timers.c
> > +++ b/kernel/time/posix-timers.c
> > @@ -72,15 +72,22 @@ static int hash(struct signal_struct *sig, unsigned int nr)
> > return hash_32(hash32_ptr(sig) ^ nr, HASH_BITS(posix_timers_hashtable));
> > }
> >
> > -static struct k_itimer *__posix_timers_find(struct hlist_head *head,
> > +static struct signal_struct *posix_sig_owner(const struct k_itimer *timer)
> > +{
> > + /* timer->it_signal can be set concurrently */
> > + unsigned long val = (unsigned long)READ_ONCE(timer->it_signal);
> > +
> > + return (struct signal_struct *)(val & ~1UL);
> > +}
> > +
> > +static struct k_itimer *posix_timers_find(struct hlist_head *head,
> > struct signal_struct *sig,
> > timer_t id)
> > {
> > struct k_itimer *timer;
> >
> > hlist_for_each_entry_rcu(timer, head, t_hash, lockdep_is_held(&hash_lock)) {
> > - /* timer->it_signal can be set concurrently */
> > - if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
> > + if ((posix_sig_owner(timer) == sig) && (timer->it_id == id))
> > return timer;
> > }
> > return NULL;
> > @@ -90,8 +97,14 @@ static struct k_itimer *posix_timer_by_id(timer_t id)
> > {
> > struct signal_struct *sig = current->signal;
> > struct hlist_head *head = &posix_timers_hashtable[hash(sig, id)];
> > + struct k_itimer *timer;
> >
> > - return __posix_timers_find(head, sig, id);
> > + hlist_for_each_entry_rcu(timer, head, t_hash) {
> > + /* timer->it_signal can be set concurrently */
> > + if ((READ_ONCE(timer->it_signal) == sig) && (timer->it_id == id))
> > + return timer;
> > + }
> > + return NULL;
> > }
> >
> > static int posix_timer_add(struct k_itimer *timer)
> > @@ -113,8 +126,9 @@ static int posix_timer_add(struct k_itimer *timer)
> > head = &posix_timers_hashtable[hash(sig, id)];
> >
> > spin_lock(&hash_lock);
> > - if (!__posix_timers_find(head, sig, id)) {
> > + if (!posix_timers_find(head, sig, id)) {
> > timer->it_id = (timer_t)id;
> > + timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
> > hlist_add_head_rcu(&timer->t_hash, head);
> > spin_unlock(&hash_lock);
> > return id;
> > @@ -453,7 +467,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
> > }
> > /*
> > * After succesful copy out, the timer ID is visible to user space
> > - * now but not yet valid because new_timer::signal is still NULL.
> > + * now but not yet valid because new_timer::signal low order bit is 1.
> > *
> > * Complete the initialization with the clock specific create
> > * callback.
> > @@ -463,7 +477,7 @@ static int do_timer_create(clockid_t which_clock, struct sigevent *event,
> > goto out;
> >
> > spin_lock_irq(¤t->sighand->siglock);
> > - /* This makes the timer valid in the hash table */
> > + /* This makes the timer valid in the hash table, clearing low order bit. */
>
> Clearing the low order bit of what? This is a full write and not a clear
> low order bit operation.
>
Prior value was (sig | 1L)
New value is (sig)
-> low order bit is cleared.
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 3/4] posix-timers: Initialise timer->it_signal in posix_timer_add()
2025-02-20 8:44 ` Eric Dumazet
@ 2025-02-20 14:13 ` Thomas Gleixner
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-20 14:13 UTC (permalink / raw)
To: Eric Dumazet
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet
On Thu, Feb 20 2025 at 09:44, Eric Dumazet wrote:
> On Thu, Feb 20, 2025 at 9:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > This fixes a potential race, in the unlikely case a thread
>> > was preempted long enough that other threads created more than
>> > 2^31 itimers.
>>
>> and then what happens?
>
> Two threads might get the same timer_id given back.
I know that, but how will someone who reads that changelog without the
knowledge and background information know?
That's the whole point of change logs to explain it for the uninformed
reader, no?
>> >
>> > spin_lock_irq(¤t->sighand->siglock);
>> > - /* This makes the timer valid in the hash table */
>> > + /* This makes the timer valid in the hash table, clearing low order bit. */
>>
>> Clearing the low order bit of what? This is a full write and not a clear
>> low order bit operation.
>>
>
> Prior value was (sig | 1L)
>
> New value is (sig)
>
> -> low order bit is cleared.
Right I know, but again it's not obvious without figuring out from some
other place what the logic behind this is.
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()
2025-02-19 12:55 [PATCH V2 0/4] posix-timers: Reduce spinlock contention Eric Dumazet
` (2 preceding siblings ...)
2025-02-19 12:55 ` [PATCH V2 3/4] posix-timers: Initialise timer->it_signal " Eric Dumazet
@ 2025-02-19 12:55 ` Eric Dumazet
2025-02-19 19:38 ` David Laight
2025-02-24 9:33 ` Thomas Gleixner
3 siblings, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-02-19 12:55 UTC (permalink / raw)
To: Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet
If many posix timers are hashed in posix_timers_hashtable,
hash_lock can be held for long durations.
This can be really bad in some cases as Thomas
explained in https://lore.kernel.org/all/87ednpyyeo.ffs@tglx/
We can perform all searches under RCU, then acquire
the lock only when there is a good chance to need it,
and after cpu caches were populated.
Also add a cond_resched() in the possible long loop.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
kernel/time/posix-timers.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index ed27c7eab456..bd73bc4707c1 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
head = &posix_timers_hashtable[hash(sig, id)];
+ rcu_read_lock();
+ if (posix_timers_find(head, sig, id)) {
+ rcu_read_unlock();
+ cond_resched();
+ continue;
+ }
+ rcu_read_unlock();
spin_lock(&hash_lock);
+ /*
+ * We must perform the lookup under hash_lock protection
+ * because another thread could have used the same id.
+ * This is very unlikely, but possible.
+ */
if (!posix_timers_find(head, sig, id)) {
timer->it_id = (timer_t)id;
timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()
2025-02-19 12:55 ` [PATCH V2 4/4] posix-timers: Use RCU " Eric Dumazet
@ 2025-02-19 19:38 ` David Laight
2025-02-19 19:46 ` Eric Dumazet
2025-02-24 9:33 ` Thomas Gleixner
1 sibling, 1 reply; 21+ messages in thread
From: David Laight @ 2025-02-19 19:38 UTC (permalink / raw)
To: Eric Dumazet
Cc: Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
linux-kernel, Benjamin Segall, Eric Dumazet
On Wed, 19 Feb 2025 12:55:22 +0000
Eric Dumazet <edumazet@google.com> wrote:
> If many posix timers are hashed in posix_timers_hashtable,
> hash_lock can be held for long durations.
>
> This can be really bad in some cases as Thomas
> explained in https://lore.kernel.org/all/87ednpyyeo.ffs@tglx/
>
> We can perform all searches under RCU, then acquire
> the lock only when there is a good chance to need it,
> and after cpu caches were populated.
>
> Also add a cond_resched() in the possible long loop.
Since this code fragment has a 'free choice' of the timer id, why not
select an empty table slot and then pick a value that maps to it?
You can run a free-list through the empty table slots so the allocate
is (almost always) fixed cost and trivial.
The only complexity arises when the table is full and needs to be
reallocated twice as large.
The high bits of the 'id' can be incremented every time the id is allocated
so stale ids can be detected (until a quite large number of allocate/free).
David
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> kernel/time/posix-timers.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
> index ed27c7eab456..bd73bc4707c1 100644
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
>
> head = &posix_timers_hashtable[hash(sig, id)];
>
> + rcu_read_lock();
> + if (posix_timers_find(head, sig, id)) {
> + rcu_read_unlock();
> + cond_resched();
> + continue;
> + }
> + rcu_read_unlock();
> spin_lock(&hash_lock);
> + /*
> + * We must perform the lookup under hash_lock protection
> + * because another thread could have used the same id.
> + * This is very unlikely, but possible.
> + */
> if (!posix_timers_find(head, sig, id)) {
> timer->it_id = (timer_t)id;
> timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()
2025-02-19 19:38 ` David Laight
@ 2025-02-19 19:46 ` Eric Dumazet
0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-02-19 19:46 UTC (permalink / raw)
To: David Laight
Cc: Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
linux-kernel, Benjamin Segall, Eric Dumazet
On Wed, Feb 19, 2025 at 8:38 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 19 Feb 2025 12:55:22 +0000
> Eric Dumazet <edumazet@google.com> wrote:
>
> > If many posix timers are hashed in posix_timers_hashtable,
> > hash_lock can be held for long durations.
> >
> > This can be really bad in some cases as Thomas
> > explained in https://lore.kernel.org/all/87ednpyyeo.ffs@tglx/
> >
> > We can perform all searches under RCU, then acquire
> > the lock only when there is a good chance to need it,
> > and after cpu caches were populated.
> >
> > Also add a cond_resched() in the possible long loop.
>
> Since this code fragment has a 'free choice' of the timer id, why not
> select an empty table slot and then pick a value that maps to it?
>
> You can run a free-list through the empty table slots so the allocate
> is (almost always) fixed cost and trivial.
> The only complexity arises when the table is full and needs to be
> reallocated twice as large.
>
> The high bits of the 'id' can be incremented every time the id is allocated
> so stale ids can be detected (until a quite large number of allocate/free).
We have to understand the RCU lookup can only fail after 2^31 timer
allocations for a given processus,
I am not sure why we would bother.
CRIU wants predictable timer_id sequences.
We have to try N, N+1, N+2 ... until we find a suitable id.
Alternative would be to add a new system call I guess.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()
2025-02-19 12:55 ` [PATCH V2 4/4] posix-timers: Use RCU " Eric Dumazet
2025-02-19 19:38 ` David Laight
@ 2025-02-24 9:33 ` Thomas Gleixner
2025-02-24 9:58 ` Eric Dumazet
1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2025-02-24 9:33 UTC (permalink / raw)
To: Eric Dumazet, Anna-Maria Behnsen, Frederic Weisbecker
Cc: linux-kernel, Benjamin Segall, Eric Dumazet, Eric Dumazet,
Peter Zijlstra
On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
>
> head = &posix_timers_hashtable[hash(sig, id)];
>
> + rcu_read_lock();
> + if (posix_timers_find(head, sig, id)) {
> + rcu_read_unlock();
> + cond_resched();
> + continue;
> + }
> + rcu_read_unlock();
> spin_lock(&hash_lock);
> + /*
> + * We must perform the lookup under hash_lock protection
> + * because another thread could have used the same id.
> + * This is very unlikely, but possible.
> + */
> if (!posix_timers_find(head, sig, id)) {
> timer->it_id = (timer_t)id;
> timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
So I played with that because I wanted understand what's going on.
Actually this change is just voodoo programming. Why?
The only case where this lockless lookup would help is when the timer
ID wrapped around and the lookup has to validate a large range of
already occupied IDs, but even then the lookup is dropping hash_lock
after each ID search. I seriously doubt that this is the case at hand
because according to Ben this happens when a process is started or
restored, which means there is no timer ID wrap around and therefore
no timer ID collision in the process itself.
In fact the extra lookup is counter productive in most cases:
With a single process, which creates 50000 timers in a loop, this
results in:
[1] mainline +lookup
create 98 ms 219 ms
and it gets worse with 100000 timers:
[2] mainline +lookup
create 313 ms 650 ms
Why? Because with 100k timers the hash buckets contain ~200 timers each,
which means in the worst case 200 timers have to be walked and
checked. Doing this twice obviously at least doubles the time.
Now there is a case where the change improves the situation, but for the
very wrong reasons:
A process forks 63 times and all forks wait on a barrier before each
instance creates 20000 timers. The processes are pinned on seperate CPUs
to achive maximum concurrency. The numbers are the average times per
process:
[3] mainline +lookup
create 157253 ms 40008 ms
But that improvement has absolutely nothing to do with the timer ID
collision. The extra lookup (and I verified with tracing) creates just
enough interleaving so that all CPUs can make progress on the hash lock
instead of being stuck in a cache line bouncing shitshow. The very same
can be achieved by:
ndelay(total_number_of_timers / $CONSTANT);
So, no. This is not a solution. The proper solution is to replace the
hash by a scaled hash with per hash bucket locks, like we have in the
futex code already. I've implemented this and the result proves the
point with the three test cases:
[1] mainline +lookup scaled hash
create 98 ms 219 ms 20 ms
[2] mainline +lookup scaled hash
create 313 ms 650 ms 48 ms
[3] mainline +lookup scaled hash
create 157253 ms 40008 ms 83 ms
This is really only a problem of the hash collisions and the resulting
list walk length, which is easy to prove by extending the tests above.
After creating the timer and synchronizing again (for the fork case),
the test invokes timer_getoverrun(2) for all created timers in a loop.
[1] mainline +lookup scaled hash
create 98 ms 219 ms 20 ms
getoverrun 62 ms 62 ms 10 ms
[2] mainline +lookup scaled hash
create 313 ms 650 ms 48 ms
getoverrun 261 ms 260 ms 20 ms
[3] mainline +lookup scaled hash
create 157253 ms 40008 ms 83 ms
getoverrun 2611 ms 2614 ms 40 ms
I picked timer_getoverrun(2) because that's just doing the lookup and
reading from the k_itimer struct after locking it. So the syscall has no
contention on the timer lock nor on anything else. According to perf the
vast majority of time is spent in the list walk to find the timer and of
course the cache foot print becomes worse the larger the bucket lists
become.
Let me write proper changelogs and a cover letter and post that stuff.
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH V2 4/4] posix-timers: Use RCU in posix_timer_add()
2025-02-24 9:33 ` Thomas Gleixner
@ 2025-02-24 9:58 ` Eric Dumazet
0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-02-24 9:58 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Anna-Maria Behnsen, Frederic Weisbecker, linux-kernel,
Benjamin Segall, Eric Dumazet, Peter Zijlstra
On Mon, Feb 24, 2025 at 10:33 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Feb 19 2025 at 12:55, Eric Dumazet wrote:
> > --- a/kernel/time/posix-timers.c
> > +++ b/kernel/time/posix-timers.c
> > @@ -125,7 +125,19 @@ static int posix_timer_add(struct k_itimer *timer)
> >
> > head = &posix_timers_hashtable[hash(sig, id)];
> >
> > + rcu_read_lock();
> > + if (posix_timers_find(head, sig, id)) {
> > + rcu_read_unlock();
> > + cond_resched();
> > + continue;
> > + }
> > + rcu_read_unlock();
> > spin_lock(&hash_lock);
> > + /*
> > + * We must perform the lookup under hash_lock protection
> > + * because another thread could have used the same id.
> > + * This is very unlikely, but possible.
> > + */
> > if (!posix_timers_find(head, sig, id)) {
> > timer->it_id = (timer_t)id;
> > timer->it_signal = (struct signal_struct *)((unsigned long)sig | 1UL);
>
> So I played with that because I wanted understand what's going on.
>
> Actually this change is just voodoo programming. Why?
>
> The only case where this lockless lookup would help is when the timer
> ID wrapped around and the lookup has to validate a large range of
> already occupied IDs, but even then the lookup is dropping hash_lock
> after each ID search. I seriously doubt that this is the case at hand
> because according to Ben this happens when a process is started or
> restored, which means there is no timer ID wrap around and therefore
> no timer ID collision in the process itself.
>
> In fact the extra lookup is counter productive in most cases:
>
> With a single process, which creates 50000 timers in a loop, this
> results in:
>
> [1] mainline +lookup
> create 98 ms 219 ms
>
> and it gets worse with 100000 timers:
>
> [2] mainline +lookup
> create 313 ms 650 ms
>
> Why? Because with 100k timers the hash buckets contain ~200 timers each,
> which means in the worst case 200 timers have to be walked and
> checked. Doing this twice obviously at least doubles the time.
>
> Now there is a case where the change improves the situation, but for the
> very wrong reasons:
>
> A process forks 63 times and all forks wait on a barrier before each
> instance creates 20000 timers. The processes are pinned on seperate CPUs
> to achive maximum concurrency. The numbers are the average times per
> process:
>
> [3] mainline +lookup
> create 157253 ms 40008 ms
>
> But that improvement has absolutely nothing to do with the timer ID
> collision. The extra lookup (and I verified with tracing) creates just
> enough interleaving so that all CPUs can make progress on the hash lock
> instead of being stuck in a cache line bouncing shitshow. The very same
> can be achieved by:
>
> ndelay(total_number_of_timers / $CONSTANT);
>
> So, no. This is not a solution. The proper solution is to replace the
> hash by a scaled hash with per hash bucket locks, like we have in the
> futex code already. I've implemented this and the result proves the
> point with the three test cases:
>
> [1] mainline +lookup scaled hash
> create 98 ms 219 ms 20 ms
>
> [2] mainline +lookup scaled hash
> create 313 ms 650 ms 48 ms
>
> [3] mainline +lookup scaled hash
> create 157253 ms 40008 ms 83 ms
>
> This is really only a problem of the hash collisions and the resulting
> list walk length, which is easy to prove by extending the tests above.
>
> After creating the timer and synchronizing again (for the fork case),
> the test invokes timer_getoverrun(2) for all created timers in a loop.
>
> [1] mainline +lookup scaled hash
> create 98 ms 219 ms 20 ms
> getoverrun 62 ms 62 ms 10 ms
>
> [2] mainline +lookup scaled hash
> create 313 ms 650 ms 48 ms
> getoverrun 261 ms 260 ms 20 ms
>
> [3] mainline +lookup scaled hash
> create 157253 ms 40008 ms 83 ms
> getoverrun 2611 ms 2614 ms 40 ms
>
> I picked timer_getoverrun(2) because that's just doing the lookup and
> reading from the k_itimer struct after locking it. So the syscall has no
> contention on the timer lock nor on anything else. According to perf the
> vast majority of time is spent in the list walk to find the timer and of
> course the cache foot print becomes worse the larger the bucket lists
> become.
>
> Let me write proper changelogs and a cover letter and post that stuff.
>
Great, we are looking forward to your patches.
^ permalink raw reply [flat|nested] 21+ messages in thread