Re: [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	linux-kernel@vger.kernel.org
Cc: "André Almeida" <andrealmeid@igalia.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Waiman Long" <longman@redhat.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Subject: Re: [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.
Date: Tue, 10 Dec 2024 23:27:32 +0100	[thread overview]
Message-ID: <8734ivcgx7.ffs@tglx> (raw)
In-Reply-To: <20241203164335.1125381-7-bigeasy@linutronix.de>

On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> +static void futex_put_old_hb_p(struct futex_hash_bucket_private *hb_p)
> +{
> +	unsigned int slots = hb_p->hash_mask + 1;
> +	struct futex_hash_bucket *hb;
> +	DEFINE_WAKE_Q(wake_q);
> +	unsigned int i;
> +
> +	for (i = 0; i < slots; i++) {
> +		struct futex_q *this;
> +
> +		hb = &hb_p->queues[i];
> +
> +		spin_lock(&hb->lock);
> +		plist_for_each_entry(this, &hb->chain, list)
> +			wake_q_add(&wake_q, this->task);
> +		spin_unlock(&hb->lock);
> +	}
> +	futex_hash_priv_put(hb_p);
> +
> +	wake_up_q(&wake_q);

So you wake up all queued waiters and let themself requeue on the new
hash.

How is that safe against the following situation:

    CPU 0                               CPU 1
    hb_p_old = mm->futex_hash_bucket;   hbp = mm->futex_hash_bucket;
    mm->futex_hash_bucket = new;
                                        // Referrence count succeeds!
                                        rcuref_get(&hpb->refcnt);
    futex_put_old_hb_p();
                                        // Queues on old hash and
                                        // is lost forever
                                        queue(hbp);

This scheme simply cannot work unless you take the rwsem read in the
wait path, which is a horrible idea. Aside of that taking the rwsem in
various code paths is counterproductive. For a big process where threads
operate on different locks heavily, you introduce a 'global'
serialization for no good reason.

I still think that the basic principle for any of this is to hold the
reference count only accross the actual hash bucket related operations
and not keep it accross schedule.

That obviously means that the hash bucket pointer is invalid after such
an operation block and needs re-evaluation if needed again, but that's
not the end of the world.

Let's look at wait() again:

wait()
{
retry:
        hb = hb_get_and_lock(key);	// Aqcuires a reference
        ret = futex_get_value_locked();
        if (ret) {
        	hb_unlock_and_put(hb);	// Drops the reference
                ...
                if (...)
                	goto retry;

        queue(hb, q, ...);
       	hb_unlock_and_put(hb);          // Drops the reference

        schedule();

        unqueue(q);                     // Does not require hb
}

Why does unqueue() work w/o a hash bucket reference?

unqueue(q)
{
retry:
	lock_ptr = READ_ONCE(q->lock_ptr);
        // Wake up ?
        if (!lock_ptr)
                return 0;

        spin_lock(lock_ptr);

        // This covers both requeue and rehash operations
        if (lock_ptr != q->lock_ptr) {
        	spin_unlock(lock_ptr);
                goto retry;
        }

        __unqueue(q);
        spin_unlock(lock_ptr);
}

Nothing in unqueue() requires a reference on the hash. The lock pointer
logic covers both requeue and rehash operations. They are equivalent,
no?

wake() is not really different. It needs to change the way how the
private retry works:

wake_op()
{
retry:
        get_key(key1);
        get_ket(key2);

retry_private:
        double_get_and_lock(&hb1, &hb2, &key1, &key2);
        .....
        double_unlock_and_put(&hb1, &hb2);
        .....
}

Moving retry private before the point where the hash bucket is retrieved
and locked is required in some other place too. And some places use
q.lock_ptr under the assumption that it can't change, which probably
needs reevaluation of the hash bucket. Other stuff like lock_pi() needs
a seperation of unlocking the hash bucket and dropping the reference.

But that are all minor changes.

All of them can be done on a per function basis before adding the actual
private hash muck, which makes the whole thing reviewable. This patch
definitely does not qualify for reviewable.

All you need are implementations for hb_get_and_lock/unlock_and_put()
plus the double variants and a hash_put() helper. Those implementations
use the global hash until all places are mopped up and then you can add
the private magic in exatly those places

There is not a single place where you need magic state fixups in the
middle of the functions or conditional locking, which turns out to be
not sufficient.

The required helpers are:

hb_get_and_lock(key)
{
        if (private(key))
        	hb = private_hash(key);		// Gets a reference
        else
                hb = hash_bucket(global_hash, key);
        hb_lock(hb);
        return hb;
}

hb_unlock_and_put(hb)
{
        hb_unlock(hb);
        if (private(hb))
        	hb_private_put(hb);
}

The double lock/unlock variants are equivalent.

private_hash(key)
{
        scoped_guard(rcu) {
 	       hash = rcu_deref(current->mm->futex.hash);
               if (rcuref_get(hash->ref))
               		return hash_bucket(hash, key);
	}

        guard(mutex)(current->mm->futex.hash_mutex);

        // Did reallocation win the race for the mutex ?
        hash = current->mm->futex.hash;
        if (!rcuref_get(hash->ref) {
        	hash = realloc_or_restore_hash();
                rcuref_get(hash);
        }
        return hash_bucket(hash, key);
}

hb_private_put(hb)
{
        hash = priv_hash(hb);
        hash_putref(hash);
}

hash_putref(hash)
{
        // Has fork dropped the initial reference to signal
        // that reallocation is required?
        //
        // If so, the last user hits the dead state
        if (!rcuref_put(&hash->ref))
        	return;

        guard(mutex)(current->mm->futex.hash_mutex);
	realloc_or_restore_hash();
}

realloc_or_restore_hash()
{
        old_hash = current->mm->futex.hash;
        new_hash = alloc_hash(current->mm->users);
        if (!new_hash) {
                // Make the old hash alive again
        	rcuref_init(old_hash->ref, 1);
                return cur_hash;
        }

        rehash(new_hash, old_hash);
        rcu_assign_pointer(current->mm->futex.hash, new_hash);
        rcu_free(old_hash);
}

rehash(new_hash, old_hash)
{
        // old_hash is marked dead, so new waiters cannot queue
        // themself and are stuck on the hash_mutex.

        for (i = 0; i < old_hash->size; i++) {
		hb1 = &old_hash->queues[i];

                // Protect the old bucket against unqueue(), as it does
                // not try to get a reference on the hash bucket. It
                // solely relies on q->lock_ptr.
                spin_lock(&hb1->lock);

		plist_for_each_entry_safe(q, tmp, hb1, list) {
			hb2 = hash_bucket(new_hash, &q->key);
			// Nesting is safe as this is a one time operation
                        spin_lock_nested(&hb2->lock);

                        plist_del(&q->list, &hb->chain);

                        // Redirect the lock pointer to the new hash
                        // bucket. See unqueue().
			q->lock_ptr = &hb2->lock;

                        plist_add(&q->list, &hb->chain);
                }
        }
}

fork()
{
        if (hash_needs_resize())
        	hash_putref(mm->futex.hash);
}

That should just work unless I'm missing something important. The charm
of utilizing rcuref for this is that there is close to zero impact on
the hotpaths, unless there is actually a reallocation in progress, which
is a rare event and applications can work around that by allocating the
appropriate hash size upfront.

Thanks,

        tglx

next prev parent reply	other threads:[~2024-12-10 22:27 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 01/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2024-12-10 15:13   ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2024-12-10 15:23   ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
2024-12-10 16:07   ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 04/11] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 05/11] futex: Track the futex hash bucket Sebastian Andrzej Siewior
2024-12-10 18:45   ` Thomas Gleixner
2024-12-12 16:41     ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 06/11] futex: Allow to re-allocate the private " Sebastian Andrzej Siewior
2024-12-10 22:27   ` Thomas Gleixner [this message]
2024-12-11 14:32     ` Thomas Gleixner
2024-12-11 15:22     ` Thomas Gleixner
2024-12-12 16:16     ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 07/11] futex: Allow to make the number of slots invariant Sebastian Andrzej Siewior
2024-12-06  8:19   ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 08/11] futex: Resize futex hash table based on number of threads Sebastian Andrzej Siewior
2024-12-06  9:27   ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 09/11] tools/perf: Add the prctl(PR_FUTEX_HASH,…) to futex-hash Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 10/11] tools/perf: The the current affinity for CPU pinning in futex-hash Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 11/11] tools/perf: Allocate futex locks on the local CPU-node Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8734ivcgx7.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=andrealmeid@igalia.com \
    --cc=bigeasy@linutronix.de \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox