Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: "Juri Lelli" <juri.lelli@redhat.com>,
	"André Almeida" <andrealmeid@igalia.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Darren Hart <dvhart@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-rt-users <linux-rt-users@vger.kernel.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Waiman Long <longman@redhat.com>
Subject: Re: Futex hash_bucket lock can break isolation and cause priority inversion on RT
Date: Fri, 25 Oct 2024 00:36:06 +0200	[thread overview]
Message-ID: <875xph5dt5.ffs@tglx> (raw)
In-Reply-To: <ZwZAicOokVUn2h8h@jlelli-thinkpadt14gen4.remote.csb>

On Wed, Oct 09 2024 at 09:36, Juri Lelli wrote:
> On 08/10/24 12:59, André Almeida wrote:
>> > > There's this work from Thomas that aims to solve corner cases like this, by
>> > > giving apps the option to instead of using the global hash table, to have
>> > > their own allocated wait queue:
>> > > https://lore.kernel.org/lkml/20160402095108.894519835@linutronix.de/
>> > > 
>> > > "Collisions on that hash can lead to performance degradation
>> > > and on real-time enabled kernels to unbound priority inversions."
>> > 
>> > This is correct. The problem is also that the hb lock is hashed on
>> > several things so if you restart/ reboot you may no longer share the hb
>> > lock with the "bad" application.
>> > 
>> > Now that I think about it, of all things we never tried a per-process
>> > (shared by threads) hb-lock which could also be hashed. This would avoid
>> > blocking on other applications, your would have to blame your own threads.
>
> Would this be somewhat similar to what Linus (and Ingo IIUC) were
> inclined to suggesting from the thread above (edited)?
>
> ---
> So automatically using a local hashtable according to some heuristic is
> definitely the way to go. And yes, the heuristic may be well be - at
> least to start - "this is a preempt-RT system" (for people who clearly
> care about having predictable latencies) or "this is actually a
> multi-node NUMA system, and I have heaps of memory"
> ---
>
> So, make it per-process local by default on PREEMPT_RT and NUMA?

I somehow did not have cycles to follow up on that proposal back then
and consequently forgot about it :(

To make this sane, per process has to be restricted to process private
futexes. That's a reasonable restriction IMO and completely avoids the
global state dance which we implemented back then.

I just digged up my old notes. Let me dump some thoughts.

1) The reason for the attachment syscall was to avoid latency on first
   usage, which can be far into the application lifetime because the
   kernel only learns about the futex when there is contention.

   For most scenarios this should be a non-issue because allocating a
   small hash table is usually not a problem, especially if you use a
   dedicated kmem_cache for it. Under memory pressure, that's a
   different issue, but a RT system should not get there in the first
   place.

   But for RT systems this might matter. Though we can be clever about
   it and allow preallocation of the per process hash table via a TBD
   sys_futex_init_private_hash() syscall or a prctl().

2) We aimed for zero collision back then by making this a indexed based
   mechanism. Though there was an open question how to limit the maximum
   table size and from my notes there was some insane number of entries
   required by some heavily threaded enterprise Java muck which used a
   gazillion of futexes...

   We need some sane default/maximum sizing of the per-process hash
   table which can be adjusted by the sysadmin.

   Whether the proper mechanism is a syscall audit, which includes
   prctl(), or a UID/GID based rlimit does not matter much. That's a
   question for system admins/configurators to answer.

Hope that helps.

Thanks,

        tglx

next prev parent reply	other threads:[~2024-10-24 22:36 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-08 15:22 Futex hash_bucket lock can break isolation and cause priority inversion on RT Juri Lelli
2024-10-08 15:38 ` André Almeida
2024-10-08 15:51   ` Sebastian Andrzej Siewior
2024-10-08 15:59     ` André Almeida
2024-10-08 18:09       ` Sebastian Andrzej Siewior
2024-10-09  8:36       ` Juri Lelli
2024-10-24 22:36         ` Thomas Gleixner [this message]
2024-10-08 17:38 ` Peter Zijlstra
2024-10-08 19:44   ` Waiman Long
2024-10-09  7:22     ` Peter Zijlstra
2024-10-09  8:26   ` Juri Lelli
2024-10-08 18:30 ` Waiman Long
2024-10-09  8:28   ` Juri Lelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875xph5dt5.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=andrealmeid@igalia.com \
    --cc=bigeasy@linutronix.de \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox