From: Tejun Heo <tj@kernel.org>
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Thomas Graf <tgraf@suug.ch>, David Vernet <void@manifault.com>,
Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
linux-crypto@vger.kernel.org, sched-ext@lists.linux.dev,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option
Date: Thu, 16 Apr 2026 21:38:52 -1000 [thread overview]
Message-ID: <aeHjjGEhlikSsxCX@slm.duckdns.org> (raw)
In-Reply-To: <aeGIsGi9fBqu9EZT@gondor.apana.org.au>
Hello,
On Fri, Apr 17, 2026 at 09:11:12AM +0800, Herbert Xu wrote:
> On Thu, Apr 16, 2026 at 02:53:41PM -1000, Tejun Heo wrote:
> >
> > Oops, that's a mistake. I meant GFP_ATOMIC kmalloc allocation. kmalloc uses
> > regular spin_lock so can't be called under raw_spin_lock. There's the new
> > kmalloc_nolock() but that means even smaller reserve size, so higher chance
> > of failing. I'm not sure it can even accomodate larger allocations.
>
> We should at least try to grow even if it fails.
>
> > Another aspect is that for some use cases, it's more problematic to fail
> > insertion than delaying hash table resize (e.g. that can lead to fork
> > failures on a thrashing system).
>
> rhashtable is meant to grow way before you reach 100% occupancy.
> So the fact that you even reached this point means that something
> has gone terribly wrong.
>
> Is this failure reproducible?
>
> I had a look at kernel/sched/ext.c and all the calls to rhashtable
> insertion come from thread context. So why does it even use a
> raw spinlock?
Yeah, the DSQ conversion is incorrect. Sorry about that. In the current
master branch, there's another rhashtable - scx_sched_hash - which is
inserted under scx_sched_lock which is a raw spin lock. This is a bit
difficult to trigger because each insertion involves loading a subscheduler
which usually takes quite some time and the whole operatoin is serialized.
In a devel branch, I'm adding a unique thread ID to task rhashtable -
scx_tid_hash:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
which is initialized and populated when the root scheduler loads. As all
existing tasks are inserted during init, it ends up expanding the hashtable
rapidly and it's not difficult to make it hit the synchronous growth path.
- Build kernel and boot. Build the example schedulers - `make -C tools/sched_ext`
- Create a lot of dormant threads:
stress-ng --pthread 1 --pthread-max 10000 --timeout 60s
- Run the qmap scheduler which enables tid hashing:
tools/sched_ext/build/bin/scx_qmap
and the following lockdep triggers:
[ 34.344355] sched_ext: BPF scheduler "qmap" enabled
[ 35.999783] sched_ext: BPF scheduler "qmap" disabled (unregistered from user space)
[ 92.996704]
[ 92.996947] =============================
[ 92.997462] [ BUG: Invalid wait context ]
[ 92.997974] 7.0.0-work-08607-g4e2e9e22ddbe-dirty #423 Not tainted
[ 92.998759] -----------------------------
[ 92.999285] scx_enable_help/1907 is trying to lock:
[ 92.999903] ffff888237aa9450 (batched_entropy_u32.lock){..-.}-{3:3}, at: get_random_u32+0x40/0x270
[ 93.001047] other info that might help us debug this:
[ 93.001716] context-{5:5}
[ 93.002057] 6 locks held by scx_enable_help/1907:
[ 93.002677] #0: ffffffff83a90d48 (scx_enable_mutex){+.+.}-{4:4}, at: scx_root_enable_workfn+0x2b/0xad0
[ 93.003872] #1: ffffffff83a90e40 (scx_fork_rwsem){++++}-{0:0}, at: scx_root_enable_workfn+0x407/0xad0
[ 93.005053] #2: ffffffff83a90f80 (scx_cgroup_ops_rwsem){++++}-{0:0}, at: scx_cgroup_lock+0x11/0x20
[ 93.006380] #3: ffffffff83c3de28 (cgroup_mutex){+.+.}-{4:4}, at: scx_root_enable_workfn+0x436/0xad0
[ 93.007535] #4: ffffffff83a90e88 (scx_tasks_lock){....}-{2:2}, at: scx_root_enable_workfn+0x511/0xad0
[ 93.008757] #5: ffffffff83b77c98 (rcu_read_lock){....}-{1:3}, at: rhashtable_insert_slow+0x65/0x9e0
[ 93.009909] stack backtrace:
[ 93.010298] CPU: 2 UID: 0 PID: 1907 Comm: scx_enable_help Not tainted 7.0.0-work-08607-g4e2e9e22ddbe-dirty #423 PREEMPT(full)
[ 93.010300] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022
[ 93.010301] Sched_ext: qmap (enabling)
[ 93.010301] Call Trace:
[ 93.010304] <TASK>
[ 93.010305] dump_stack_lvl+0x50/0x70
[ 93.010307] __lock_acquire+0x9e8/0x2760
[ 93.010314] lock_acquire+0xb2/0x240
[ 93.010317] get_random_u32+0x58/0x270
[ 93.010321] bucket_table_alloc+0xa6/0x190
[ 93.010322] rhashtable_insert_slow+0x8ac/0x9e0
[ 93.010325] scx_root_enable_workfn+0x893/0xad0
[ 93.010328] kthread_worker_fn+0x108/0x300
[ 93.010330] kthread+0xfa/0x120
[ 93.010332] ret_from_fork+0x175/0x320
[ 93.010334] ret_from_fork_asm+0x1a/0x30
[ 93.010336] </TASK>
Now, I can move the insertion out of the raw scx_tasks_lock; however, that
complicates init path as now it has to account for possible races against
task exit path.
Also, taking a step back, if rhashtable allows usage under raw spin locks,
isn't this broken regardless of how easy or difficult it may be to reproduce
the problem? Practically speaking, the scx_sched_hash one is unlikely to
trigger in real world; however, it is still theoretically possible and I'm
pretty positive that one would be able to create a repro case with the right
interference workload. It'd be contrived for sure but should be possible.
Thanks.
--
tejun
next prev parent reply other threads:[~2026-04-17 7:38 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-17 0:24 [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option Tejun Heo
2026-04-17 0:24 ` [PATCH for-7.1-fixes 2/2] sched_ext: mark scx_sched_hash and dsq_hash no_sync_grow Tejun Heo
2026-04-17 0:43 ` [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option Herbert Xu
2026-04-17 0:53 ` Tejun Heo
2026-04-17 1:11 ` Herbert Xu
2026-04-17 7:38 ` Tejun Heo [this message]
2026-04-17 7:51 ` Herbert Xu
2026-04-17 16:25 ` Tejun Heo
2026-04-18 0:44 ` Herbert Xu
2026-04-18 0:52 ` Tejun Heo
2026-04-18 0:53 ` Herbert Xu
2026-04-18 1:38 ` [PATCH] rhashtable: Restore insecure_elasticity toggle Herbert Xu
2026-04-18 1:41 ` [v2 PATCH] " Herbert Xu
2026-04-17 1:22 ` [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option Herbert Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aeHjjGEhlikSsxCX@slm.duckdns.org \
--to=tj@kernel.org \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=emil@etsalapatis.com \
--cc=herbert@gondor.apana.org.au \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tgraf@suug.ch \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox