* Futex hash_bucket lock can break isolation and cause priority inversion on RT
@ 2024-10-08 15:22 Juri Lelli
2024-10-08 15:38 ` André Almeida
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Juri Lelli @ 2024-10-08 15:22 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar
Cc: Peter Zijlstra, Darren Hart, Davidlohr Bueso, André Almeida,
LKML, linux-rt-users, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
Hello,
A report concerning latency sensitive applications using futexes on a
PREEMPT_RT kernel brought me to (try to!) refresh my understanding of
how futexes are implemented. The following is an attempt to make sense
of what I am seeing from traces, validate that it indeed might make
sense and possibly collect ideas on how to address the issue at hand.
Simplifying what is actually a quite complicated setup composed of
non-realtime (i.e., background load mostly related to a containers
orchestrator) and realtime tasks, we can consider the following
situation:
- Multiprocessor system running a PREEMPT_RT kernel
- Housekeeping CPUs (usually 2) running background tasks + “isolated”
CPUs running latency sensitive tasks (possibly need to run also
non-realtime activities at times)
- CPUs are isolated dynamically by using nohz_full/rcu_nocbs options
and affinity, no static scheduler isolation is used (i.e., no
isolcpus=domain)
- Threaded IRQs, RCU related kthreads, timers, etc. are configured with
the highest priorities on the system (FIFO)
- Latency sensitive application threads run at FIFO priority below the
set of tasks from the former point
- Latency sensitive application uses futexes, but they protect data
only shared among tasks running on the isolated set of CPUs
- Tasks running on housekeeping CPUs also use futexes
- Futexes belonging to the above two sets of non interacting tasks are
distinct
Under these conditions the actual issue presents itself when:
- A background task on a housekeeping CPUs enters sys_futex syscall and
locks a hb->lock (PI enabled mutex on RT)
- That background task gets preempted by a higher priority task (e.g.
NIC irq thread)
- A low latency application task on an isolated CPU also enters
sys_futex, hash collision towards the background task hb, tries to
grab hb->lock and, even if it boosts the background task, it still
needs to wait for the higher priority task (NIC irq) to finish
executing on the housekeeping CPU and eventually misses its deadline
Now, of course by making the latency sensitive application tasks use a
higher priority than anything on housekeeping CPUs we could avoid the
issue, but the fact that an implicit in-kernel link between otherwise
unrelated tasks might cause priority inversion is probably not ideal?
Thus this email.
Does this report make any sense? If it does, has this issue ever been
reported and possibly discussed? I guess it’s kind of a corner case, but
I wonder if anybody has suggestions already on how to possibly try to
tackle it from a kernel perspective.
Thanks!
Juri
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:22 Futex hash_bucket lock can break isolation and cause priority inversion on RT Juri Lelli
@ 2024-10-08 15:38 ` André Almeida
2024-10-08 15:51 ` Sebastian Andrzej Siewior
2024-10-08 17:38 ` Peter Zijlstra
2024-10-08 18:30 ` Waiman Long
2 siblings, 1 reply; 13+ messages in thread
From: André Almeida @ 2024-10-08 15:38 UTC (permalink / raw)
To: Juri Lelli
Cc: Peter Zijlstra, Ingo Molnar, Darren Hart, Davidlohr Bueso, LKML,
Thomas Gleixner, linux-rt-users, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
Hi Juri,
Em 08/10/2024 12:22, Juri Lelli escreveu:
[...]
> Now, of course by making the latency sensitive application tasks use a
> higher priority than anything on housekeeping CPUs we could avoid the
> issue, but the fact that an implicit in-kernel link between otherwise
> unrelated tasks might cause priority inversion is probably not ideal?
> Thus this email.
>
> Does this report make any sense? If it does, has this issue ever been
> reported and possibly discussed? I guess it’s kind of a corner case, but
> I wonder if anybody has suggestions already on how to possibly try to
> tackle it from a kernel perspective.
>
That's right, unrelated apps can share the same futex bucket, causing
those side effects. The bucket is determined by futex_hash() and then
tasks get the hash bucket lock at futex_q_lock(), and none of those
functions have awareness of priorities.
There's this work from Thomas that aims to solve corner cases like this,
by giving apps the option to instead of using the global hash table, to
have their own allocated wait queue:
https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
"Collisions on that hash can lead to performance degradation
and on real-time enabled kernels to unbound priority inversions."
> Thanks!
> Juri
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:38 ` André Almeida
@ 2024-10-08 15:51 ` Sebastian Andrzej Siewior
2024-10-08 15:59 ` André Almeida
0 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-08 15:51 UTC (permalink / raw)
To: André Almeida
Cc: Juri Lelli, Peter Zijlstra, Ingo Molnar, Darren Hart,
Davidlohr Bueso, LKML, Thomas Gleixner, linux-rt-users,
Valentin Schneider, Waiman Long
On 2024-10-08 12:38:11 [-0300], André Almeida wrote:
> Em 08/10/2024 12:22, Juri Lelli escreveu:
>
> [...]
>
> > Now, of course by making the latency sensitive application tasks use a
> > higher priority than anything on housekeeping CPUs we could avoid the
> > issue, but the fact that an implicit in-kernel link between otherwise
> > unrelated tasks might cause priority inversion is probably not ideal?
> > Thus this email.
> >
> > Does this report make any sense? If it does, has this issue ever been
> > reported and possibly discussed? I guess it’s kind of a corner case, but
> > I wonder if anybody has suggestions already on how to possibly try to
> > tackle it from a kernel perspective.
> >
>
> That's right, unrelated apps can share the same futex bucket, causing those
> side effects. The bucket is determined by futex_hash() and then tasks get
> the hash bucket lock at futex_q_lock(), and none of those functions have
> awareness of priorities.
almost. Since Juri mentioned PREEMPT_RT the hb locks are aware of
priorities. So in his case there was a PI boost, the task with the
higher priority can grab the hb lock before others may however since the
owner is blocked by the NIC thread, it can't make progress.
Lifting the priority over the NIC-thread would bring the owner on the
CPU in order to drop the hb lock.
> There's this work from Thomas that aims to solve corner cases like this, by
> giving apps the option to instead of using the global hash table, to have
> their own allocated wait queue:
> https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
>
> "Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels to unbound priority inversions."
This is correct. The problem is also that the hb lock is hashed on
several things so if you restart/ reboot you may no longer share the hb
lock with the "bad" application.
Now that I think about it, of all things we never tried a per-process
(shared by threads) hb-lock which could also be hashed. This would avoid
blocking on other applications, your would have to blame your own threads.
> > Thanks!
> > Juri
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:51 ` Sebastian Andrzej Siewior
@ 2024-10-08 15:59 ` André Almeida
2024-10-08 18:09 ` Sebastian Andrzej Siewior
2024-10-09 8:36 ` Juri Lelli
0 siblings, 2 replies; 13+ messages in thread
From: André Almeida @ 2024-10-08 15:59 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Juri Lelli, Peter Zijlstra, Ingo Molnar, Darren Hart,
Davidlohr Bueso, LKML, Thomas Gleixner, linux-rt-users,
Valentin Schneider, Waiman Long
Em 08/10/2024 12:51, Sebastian Andrzej Siewior escreveu:
> On 2024-10-08 12:38:11 [-0300], André Almeida wrote:
>> Em 08/10/2024 12:22, Juri Lelli escreveu:
>>
>> [...]
>>
>>> Now, of course by making the latency sensitive application tasks use a
>>> higher priority than anything on housekeeping CPUs we could avoid the
>>> issue, but the fact that an implicit in-kernel link between otherwise
>>> unrelated tasks might cause priority inversion is probably not ideal?
>>> Thus this email.
>>>
>>> Does this report make any sense? If it does, has this issue ever been
>>> reported and possibly discussed? I guess it’s kind of a corner case, but
>>> I wonder if anybody has suggestions already on how to possibly try to
>>> tackle it from a kernel perspective.
>>>
>>
>> That's right, unrelated apps can share the same futex bucket, causing those
>> side effects. The bucket is determined by futex_hash() and then tasks get
>> the hash bucket lock at futex_q_lock(), and none of those functions have
>> awareness of priorities.
>
> almost. Since Juri mentioned PREEMPT_RT the hb locks are aware of
> priorities. So in his case there was a PI boost, the task with the
> higher priority can grab the hb lock before others may however since the
> owner is blocked by the NIC thread, it can't make progress.
> Lifting the priority over the NIC-thread would bring the owner on the
> CPU in order to drop the hb lock.
>
Oh that's right, thanks for pointing it out!
>> There's this work from Thomas that aims to solve corner cases like this, by
>> giving apps the option to instead of using the global hash table, to have
>> their own allocated wait queue:
>> https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
>>
>> "Collisions on that hash can lead to performance degradation
>> and on real-time enabled kernels to unbound priority inversions."
>
> This is correct. The problem is also that the hb lock is hashed on
> several things so if you restart/ reboot you may no longer share the hb
> lock with the "bad" application.
>
> Now that I think about it, of all things we never tried a per-process
> (shared by threads) hb-lock which could also be hashed. This would avoid
> blocking on other applications, your would have to blame your own threads.
>
So if every process has it owns hb-lock, every process has their own
bucket? It would act just like a linked list then?
>>> Thanks!
>>> Juri
>
> Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:59 ` André Almeida
@ 2024-10-08 18:09 ` Sebastian Andrzej Siewior
2024-10-09 8:36 ` Juri Lelli
1 sibling, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-08 18:09 UTC (permalink / raw)
To: André Almeida
Cc: Juri Lelli, Peter Zijlstra, Ingo Molnar, Darren Hart,
Davidlohr Bueso, LKML, Thomas Gleixner, linux-rt-users,
Valentin Schneider, Waiman Long
On 2024-10-08 12:59:24 [-0300], André Almeida wrote:
>
> So if every process has it owns hb-lock, every process has their own bucket?
> It would act just like a linked list then?
If you have one hb-lock, yes. But you could have 4 or 8 slots. A slot
has 64 bytes due to alignment. As-is the 8 slots would occupy 512 bytes
of memory.
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:59 ` André Almeida
2024-10-08 18:09 ` Sebastian Andrzej Siewior
@ 2024-10-09 8:36 ` Juri Lelli
2024-10-24 22:36 ` Thomas Gleixner
1 sibling, 1 reply; 13+ messages in thread
From: Juri Lelli @ 2024-10-09 8:36 UTC (permalink / raw)
To: André Almeida
Cc: Sebastian Andrzej Siewior, Peter Zijlstra, Ingo Molnar,
Darren Hart, Davidlohr Bueso, LKML, Thomas Gleixner,
linux-rt-users, Valentin Schneider, Waiman Long
Hi André and Sebastian,
Thank you so much for your quick replies and for providing context that
I was missing!
On 08/10/24 12:59, André Almeida wrote:
> Em 08/10/2024 12:51, Sebastian Andrzej Siewior escreveu:
> > On 2024-10-08 12:38:11 [-0300], André Almeida wrote:
> > > Em 08/10/2024 12:22, Juri Lelli escreveu:
> > >
> > > [...]
> > >
> > > > Now, of course by making the latency sensitive application tasks use a
> > > > higher priority than anything on housekeeping CPUs we could avoid the
> > > > issue, but the fact that an implicit in-kernel link between otherwise
> > > > unrelated tasks might cause priority inversion is probably not ideal?
> > > > Thus this email.
> > > >
> > > > Does this report make any sense? If it does, has this issue ever been
> > > > reported and possibly discussed? I guess it’s kind of a corner case, but
> > > > I wonder if anybody has suggestions already on how to possibly try to
> > > > tackle it from a kernel perspective.
> > > >
> > >
> > > That's right, unrelated apps can share the same futex bucket, causing those
> > > side effects. The bucket is determined by futex_hash() and then tasks get
> > > the hash bucket lock at futex_q_lock(), and none of those functions have
> > > awareness of priorities.
> >
> > almost. Since Juri mentioned PREEMPT_RT the hb locks are aware of
> > priorities. So in his case there was a PI boost, the task with the
> > higher priority can grab the hb lock before others may however since the
> > owner is blocked by the NIC thread, it can't make progress.
> > Lifting the priority over the NIC-thread would bring the owner on the
> > CPU in order to drop the hb lock.
> >
>
> Oh that's right, thanks for pointing it out!
>
> > > There's this work from Thomas that aims to solve corner cases like this, by
> > > giving apps the option to instead of using the global hash table, to have
> > > their own allocated wait queue:
> > > https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
> > >
> > > "Collisions on that hash can lead to performance degradation
> > > and on real-time enabled kernels to unbound priority inversions."
> >
> > This is correct. The problem is also that the hb lock is hashed on
> > several things so if you restart/ reboot you may no longer share the hb
> > lock with the "bad" application.
> >
> > Now that I think about it, of all things we never tried a per-process
> > (shared by threads) hb-lock which could also be hashed. This would avoid
> > blocking on other applications, your would have to blame your own threads.
> >
Would this be somewhat similar to what Linus (and Ingo IIUC) were
inclined to suggesting from the thread above (edited)?
---
So automatically using a local hashtable according to some heuristic is
definitely the way to go. And yes, the heuristic may be well be - at
least to start - "this is a preempt-RT system" (for people who clearly
care about having predictable latencies) or "this is actually a
multi-node NUMA system, and I have heaps of memory"
---
So, make it per-process local by default on PREEMPT_RT and NUMA?
Thanks,
Juri
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-09 8:36 ` Juri Lelli
@ 2024-10-24 22:36 ` Thomas Gleixner
0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2024-10-24 22:36 UTC (permalink / raw)
To: Juri Lelli, André Almeida
Cc: Sebastian Andrzej Siewior, Peter Zijlstra, Ingo Molnar,
Darren Hart, Davidlohr Bueso, LKML, linux-rt-users,
Valentin Schneider, Waiman Long
On Wed, Oct 09 2024 at 09:36, Juri Lelli wrote:
> On 08/10/24 12:59, André Almeida wrote:
>> > > There's this work from Thomas that aims to solve corner cases like this, by
>> > > giving apps the option to instead of using the global hash table, to have
>> > > their own allocated wait queue:
>> > > https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
>> > >
>> > > "Collisions on that hash can lead to performance degradation
>> > > and on real-time enabled kernels to unbound priority inversions."
>> >
>> > This is correct. The problem is also that the hb lock is hashed on
>> > several things so if you restart/ reboot you may no longer share the hb
>> > lock with the "bad" application.
>> >
>> > Now that I think about it, of all things we never tried a per-process
>> > (shared by threads) hb-lock which could also be hashed. This would avoid
>> > blocking on other applications, your would have to blame your own threads.
>
> Would this be somewhat similar to what Linus (and Ingo IIUC) were
> inclined to suggesting from the thread above (edited)?
>
> ---
> So automatically using a local hashtable according to some heuristic is
> definitely the way to go. And yes, the heuristic may be well be - at
> least to start - "this is a preempt-RT system" (for people who clearly
> care about having predictable latencies) or "this is actually a
> multi-node NUMA system, and I have heaps of memory"
> ---
>
> So, make it per-process local by default on PREEMPT_RT and NUMA?
I somehow did not have cycles to follow up on that proposal back then
and consequently forgot about it :(
To make this sane, per process has to be restricted to process private
futexes. That's a reasonable restriction IMO and completely avoids the
global state dance which we implemented back then.
I just digged up my old notes. Let me dump some thoughts.
1) The reason for the attachment syscall was to avoid latency on first
usage, which can be far into the application lifetime because the
kernel only learns about the futex when there is contention.
For most scenarios this should be a non-issue because allocating a
small hash table is usually not a problem, especially if you use a
dedicated kmem_cache for it. Under memory pressure, that's a
different issue, but a RT system should not get there in the first
place.
But for RT systems this might matter. Though we can be clever about
it and allow preallocation of the per process hash table via a TBD
sys_futex_init_private_hash() syscall or a prctl().
2) We aimed for zero collision back then by making this a indexed based
mechanism. Though there was an open question how to limit the maximum
table size and from my notes there was some insane number of entries
required by some heavily threaded enterprise Java muck which used a
gazillion of futexes...
We need some sane default/maximum sizing of the per-process hash
table which can be adjusted by the sysadmin.
Whether the proper mechanism is a syscall audit, which includes
prctl(), or a UID/GID based rlimit does not matter much. That's a
question for system admins/configurators to answer.
Hope that helps.
Thanks,
tglx
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:22 Futex hash_bucket lock can break isolation and cause priority inversion on RT Juri Lelli
2024-10-08 15:38 ` André Almeida
@ 2024-10-08 17:38 ` Peter Zijlstra
2024-10-08 19:44 ` Waiman Long
2024-10-09 8:26 ` Juri Lelli
2024-10-08 18:30 ` Waiman Long
2 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2024-10-08 17:38 UTC (permalink / raw)
To: Juri Lelli
Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
André Almeida, LKML, linux-rt-users, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
On Tue, Oct 08, 2024 at 04:22:26PM +0100, Juri Lelli wrote:
> Does this report make any sense? If it does, has this issue ever been
> reported and possibly discussed? I guess it’s kind of a corner case, but
> I wonder if anybody has suggestions already on how to possibly try to
> tackle it from a kernel perspective.
Any shared lock can cause such havoc. Futex hash buckets is just one of
a number of very popular ones that's relatively easy to hit.
I do have some futex-numa patches still pending, but they won't
magically sure this either. Userspace needs help at the very least.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 17:38 ` Peter Zijlstra
@ 2024-10-08 19:44 ` Waiman Long
2024-10-09 7:22 ` Peter Zijlstra
2024-10-09 8:26 ` Juri Lelli
1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2024-10-08 19:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
André Almeida, LKML, linux-rt-users, Valentin Schneider,
Sebastian Andrzej Siewior, Juri Lelli
On 10/8/24 1:38 PM, Peter Zijlstra wrote:
> On Tue, Oct 08, 2024 at 04:22:26PM +0100, Juri Lelli wrote:
>> Does this report make any sense? If it does, has this issue ever been
>> reported and possibly discussed? I guess it’s kind of a corner case, but
>> I wonder if anybody has suggestions already on how to possibly try to
>> tackle it from a kernel perspective.
> Any shared lock can cause such havoc. Futex hash buckets is just one of
> a number of very popular ones that's relatively easy to hit.
>
> I do have some futex-numa patches still pending, but they won't
> magically sure this either. Userspace needs help at the very least.
Regarding the futex-numa patches, are you planning to get them merged
soon? We have customers asking for that.
Cheers,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 19:44 ` Waiman Long
@ 2024-10-09 7:22 ` Peter Zijlstra
0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2024-10-09 7:22 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
André Almeida, LKML, linux-rt-users, Valentin Schneider,
Sebastian Andrzej Siewior, Juri Lelli
On Tue, Oct 08, 2024 at 03:44:23PM -0400, Waiman Long wrote:
>
> On 10/8/24 1:38 PM, Peter Zijlstra wrote:
> > On Tue, Oct 08, 2024 at 04:22:26PM +0100, Juri Lelli wrote:
> > > Does this report make any sense? If it does, has this issue ever been
> > > reported and possibly discussed? I guess it’s kind of a corner case, but
> > > I wonder if anybody has suggestions already on how to possibly try to
> > > tackle it from a kernel perspective.
> > Any shared lock can cause such havoc. Futex hash buckets is just one of
> > a number of very popular ones that's relatively easy to hit.
> >
> > I do have some futex-numa patches still pending, but they won't
> > magically sure this either. Userspace needs help at the very least.
>
> Regarding the futex-numa patches, are you planning to get them merged soon?
> We have customers asking for that.
They're on the todo list somewhere... I know Ampere had interest, other
than that nobody really said anything.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 17:38 ` Peter Zijlstra
2024-10-08 19:44 ` Waiman Long
@ 2024-10-09 8:26 ` Juri Lelli
1 sibling, 0 replies; 13+ messages in thread
From: Juri Lelli @ 2024-10-09 8:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
André Almeida, LKML, linux-rt-users, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Hi Peter,
On 08/10/24 19:38, Peter Zijlstra wrote:
> On Tue, Oct 08, 2024 at 04:22:26PM +0100, Juri Lelli wrote:
> > Does this report make any sense? If it does, has this issue ever been
> > reported and possibly discussed? I guess it’s kind of a corner case, but
> > I wonder if anybody has suggestions already on how to possibly try to
> > tackle it from a kernel perspective.
>
> Any shared lock can cause such havoc. Futex hash buckets is just one of
> a number of very popular ones that's relatively easy to hit.
Ah yes indeed. Just thought that if we have ideas on how to possibly
make this better it might still be worthwhile, even if it won't fix all
issues.
> I do have some futex-numa patches still pending, but they won't
> magically sure this either. Userspace needs help at the very least.
Thanks!
Juri
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 15:22 Futex hash_bucket lock can break isolation and cause priority inversion on RT Juri Lelli
2024-10-08 15:38 ` André Almeida
2024-10-08 17:38 ` Peter Zijlstra
@ 2024-10-08 18:30 ` Waiman Long
2024-10-09 8:28 ` Juri Lelli
2 siblings, 1 reply; 13+ messages in thread
From: Waiman Long @ 2024-10-08 18:30 UTC (permalink / raw)
To: Juri Lelli, Thomas Gleixner, Ingo Molnar
Cc: Peter Zijlstra, Darren Hart, Davidlohr Bueso, André Almeida,
LKML, linux-rt-users, Valentin Schneider,
Sebastian Andrzej Siewior
On 10/8/24 11:22 AM, Juri Lelli wrote:
> Hello,
>
> A report concerning latency sensitive applications using futexes on a
> PREEMPT_RT kernel brought me to (try to!) refresh my understanding of
> how futexes are implemented. The following is an attempt to make sense
> of what I am seeing from traces, validate that it indeed might make
> sense and possibly collect ideas on how to address the issue at hand.
>
> Simplifying what is actually a quite complicated setup composed of
> non-realtime (i.e., background load mostly related to a containers
> orchestrator) and realtime tasks, we can consider the following
> situation:
>
> - Multiprocessor system running a PREEMPT_RT kernel
> - Housekeeping CPUs (usually 2) running background tasks + “isolated”
> CPUs running latency sensitive tasks (possibly need to run also
> non-realtime activities at times)
> - CPUs are isolated dynamically by using nohz_full/rcu_nocbs options
> and affinity, no static scheduler isolation is used (i.e., no
> isolcpus=domain)
> - Threaded IRQs, RCU related kthreads, timers, etc. are configured with
> the highest priorities on the system (FIFO)
> - Latency sensitive application threads run at FIFO priority below the
> set of tasks from the former point
> - Latency sensitive application uses futexes, but they protect data
> only shared among tasks running on the isolated set of CPUs
> - Tasks running on housekeeping CPUs also use futexes
> - Futexes belonging to the above two sets of non interacting tasks are
> distinct
>
> Under these conditions the actual issue presents itself when:
>
> - A background task on a housekeeping CPUs enters sys_futex syscall and
> locks a hb->lock (PI enabled mutex on RT)
> - That background task gets preempted by a higher priority task (e.g.
> NIC irq thread)
> - A low latency application task on an isolated CPU also enters
> sys_futex, hash collision towards the background task hb, tries to
> grab hb->lock and, even if it boosts the background task, it still
> needs to wait for the higher priority task (NIC irq) to finish
> executing on the housekeeping CPU and eventually misses its deadline
>
> Now, of course by making the latency sensitive application tasks use a
> higher priority than anything on housekeeping CPUs we could avoid the
> issue, but the fact that an implicit in-kernel link between otherwise
> unrelated tasks might cause priority inversion is probably not ideal?
> Thus this email.
>
> Does this report make any sense? If it does, has this issue ever been
> reported and possibly discussed? I guess it’s kind of a corner case, but
> I wonder if anybody has suggestions already on how to possibly try to
> tackle it from a kernel perspective.
Just a question. Is the low latency application using PI futex or the
normal wait-wake futex? We could use separate set of hash buckets for
these distinct futex types.
Cheers,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
2024-10-08 18:30 ` Waiman Long
@ 2024-10-09 8:28 ` Juri Lelli
0 siblings, 0 replies; 13+ messages in thread
From: Juri Lelli @ 2024-10-09 8:28 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, André Almeida, LKML, linux-rt-users,
Valentin Schneider, Sebastian Andrzej Siewior
Hi Waiman,
On 08/10/24 14:30, Waiman Long wrote:
> On 10/8/24 11:22 AM, Juri Lelli wrote:
...
> > Now, of course by making the latency sensitive application tasks use a
> > higher priority than anything on housekeeping CPUs we could avoid the
> > issue, but the fact that an implicit in-kernel link between otherwise
> > unrelated tasks might cause priority inversion is probably not ideal?
> > Thus this email.
> >
> > Does this report make any sense? If it does, has this issue ever been
> > reported and possibly discussed? I guess it’s kind of a corner case, but
> > I wonder if anybody has suggestions already on how to possibly try to
> > tackle it from a kernel perspective.
>
> Just a question. Is the low latency application using PI futex or the normal
> wait-wake futex? We could use separate set of hash buckets for these
> distinct futex types.
AFAIK it uses normal futexes (or a mix at best). Also I believe it
relies on libraries, so somewhat difficult to tell for certain.
Thanks,
Juri
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-10-24 22:36 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08 15:22 Futex hash_bucket lock can break isolation and cause priority inversion on RT Juri Lelli
2024-10-08 15:38 ` André Almeida
2024-10-08 15:51 ` Sebastian Andrzej Siewior
2024-10-08 15:59 ` André Almeida
2024-10-08 18:09 ` Sebastian Andrzej Siewior
2024-10-09 8:36 ` Juri Lelli
2024-10-24 22:36 ` Thomas Gleixner
2024-10-08 17:38 ` Peter Zijlstra
2024-10-08 19:44 ` Waiman Long
2024-10-09 7:22 ` Peter Zijlstra
2024-10-09 8:26 ` Juri Lelli
2024-10-08 18:30 ` Waiman Long
2024-10-09 8:28 ` Juri Lelli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox