From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B424339B34 for ; Mon, 23 Feb 2026 17:54:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.43 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771869243; cv=none; b=SNoDVt3zUJTqLQldHEJw8ZdM+TdxrS2aCMJn7ZzLWn9J2zhDi9tf3+5rWYlfdQO1lEv9XLHqbQ7YG40nzMSHztsMORGXB18besoa5xZxIU7pRuZzRGIv6+p+DicHlc4XZeLlyPMAG13+qSVcazum0op9bMmQJa8OUKmnctoiW3I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771869243; c=relaxed/simple; bh=N9rI4WnxbxsAM6FWGnAYsN/98EyjXJEtml/raSVTcHU=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=YMvZ4EnRX09FlkRuNddeFj1hyDSdu94RfptTgoTeJwGvOEKlcvIa1bJn7Ys0XNNjvAKPIFPitPQ09/PUBr8HwtfOZxTPzQskg84i5ny/JGILgMsrsKSDsQuq6Y2ichXOYCbMpv35zv6xGGd0YKq+XZOM1fp0TxlEtA/l3EJ00f4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=DQjT6DJ7; arc=none smtp.client-ip=209.85.128.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DQjT6DJ7" Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-483487335c2so42762345e9.2 for ; Mon, 23 Feb 2026 09:54:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771869239; x=1772474039; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=8+18w1dZQXP+kB9OaoRlfbfr5ZTWLsMrcabvqk9ldh0=; b=DQjT6DJ7PIbHqHV7SX7MWGraV4Qc5aYF92nBUeuzCQkm1UQuMMuusozoJ2usOS/g5T Gnf+pGGad4iUgvfTiH+/Ywbp7IfQEB2V8LlptvRbPJrmxYjK1iyAdjmfH32IXHYOU1xo C7b/pDj2ZhmxdOadJzoeMikScHFyKL64MhT15pMeUDkIm98k2UHBUYuAUzpnm9NuQPMr Cb6dmJWzXV6KoKVVeYQ8K5XUUaGbQwPnYi0XDCEJbLWxrqF086D4rwGcz2Vf8p02bBy6 /uVayV0ZGonm1WF4gC5/hGUMaDOgfaqt6tXL+Gr0qIcmi4x7FgL1eNh8JM4ChzAiGL/y R3hg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771869239; x=1772474039; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=8+18w1dZQXP+kB9OaoRlfbfr5ZTWLsMrcabvqk9ldh0=; b=vJJzZfIC3l6Wfgm7O9YdVktGiA9PmZL5WKhYe0x1odPorw5Rr9eUT9nm/wObWSR5df rTQ03jpq1U8Po1W2Zv+GRRzUZmlCzj49q4CIhMnGMHD0nhnkgAjIkLZxcVUyBqGCEUTe lCum1GZEzVFt4GBdp86JqRo8rx6m9CvFYCZuk5a7gEyUOVcdAxkoTI3gtsfNumVOmpUQ 593d670FUGoGewek3OMDH/ad11TASrL4NEz2Zpm/FlfHejkWGjQfli4fMBXZNp0tItIG /+EuZfuIF11GE0iBySLKoN7qm5j1SwnGSeLFBl35KXu9fwi+J1kvPHanHKgJNopH4pcH QYVA== X-Gm-Message-State: AOJu0YxXvO89CNejEQdXeh0hFcjdAbEK/5fEHr8zVFXT0bbB6H/gGOVh D0THW+Rar0rJjWHouPZS8EFr+s7hJgHGrpu1dA2wMxTfGKj/IKwrj3ni X-Gm-Gg: AZuq6aJOX4hoyiguCkpWJqMHXA+dcJPZycwnTAiB12fD2iqyG8wUK0/j8qz5wQEZ00y XArmdUnjfrabIPFB3R3fICDWYTeA/7OqSlReSsnAlXnJAVlJBaPjBAyhhzYWM75o+x/wlqoD824 SAF+nsvoHQPAbOFOnpsZVhkN5LMA38do9AN3BPJACQBnOe8/vZow8D4S8R3ALwOCj35G0PN+NzD xZHDrk0u8uIkE2ubuId6fpF4yXj72OOfmBJi1mbxfGxwMbqMEWXC3VN+9mc4AJ5C6fHOT6b6JuL 1cxHXRt5vEMuyUZ/F7maKIwp16uFFJYTbvMVLCvQqsnxZiq8dDoQvcLSxAMr6fCY3aEnCWbe+wI Ych9l38d1JCgjXhOfMg7wbMkOEFy8UW5TepZqjZwps/311/FlR6Q8TLDlBMwq1QBuKhon3EQp28 dE2UHr4ylUWISYbFQ+MgN73pfz61pR66Aa8MY0cCVsnsinzOOEfRDpqxlJFqJR8iZj3wIYdU7eW +o= X-Received: by 2002:a05:600c:4f93:b0:477:df7:b020 with SMTP id 5b1f17b1804b1-483aaa15e35mr161984525e9.18.1771869238746; Mon, 23 Feb 2026 09:53:58 -0800 (PST) Received: from pumpkin (82-69-66-36.dsl.in-addr.zen.co.uk. [82.69.66.36]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-483a4300c4dsm105328385e9.26.2026.02.23.09.53.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Feb 2026 09:53:58 -0800 (PST) Date: Mon, 23 Feb 2026 17:53:57 +0000 From: David Laight To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , mathieu.desnoyers@efficios.com, Mark Rutland , cmarinas@kernel.org, maddy@linux.ibm.com, hca@linux.ibm.com, ryan.roberts@arm.com Subject: Re: [RFC] in-kernel rseq Message-ID: <20260223175357.481c161e@pumpkin> In-Reply-To: <20260223163843.GR1282955@noisy.programming.kicks-ass.net> References: <20260223163843.GR1282955@noisy.programming.kicks-ass.net> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Mon, 23 Feb 2026 17:38:43 +0100 Peter Zijlstra wrote: > Hi, > > It has come to my attention that various people are struggling with > preempt_disable()+preempt_enable() costs for various architectures. > > Mostly in relation to things like this_cpu_ and or local_. > > The below is a very crude (and broken, more on that below) POC. > > So the 'main' advantage of this over preempt_disable()/preempt_enable(), > it on the preempt_enable() side, this elides the whole conditional and > call schedule() nonsense. > > Now, on to the broken part, the below 'commit' address should be the > address of the 'STORE' instruction. In case of LL/SC, it should be the > SC, in case of LSE, it should be the LSE instruction. I think it would be better as the address of the instruction after the 'store'. You probably don't need separate 'begin' and 'restart' addresses. It might be enough to save the 'restart' address and a byte length directly in 'current', much simpler code. How much it helps is another matter. I'm sure I remember something about per-cpu data being used for something because it was faster then using 'current' - not sure of the context. The real problem with rseq is they don't scale. At least this against the context switch code - which a slow path. I wonder if anyone (not reading the code) would notice if a 'short term preempt-disable' were implemented that missed out the reschedule test were implemented? I think that is just unlocked RMW of a per-cpu/thread variable. David > > This means, it needs to be woven into the asm... and I'm not that handy > with arm64 asm. > > The pseudo code would be something like: > > current->sched_seq = &_R; > ... > > _start: compute per cpu-addr > load addr > $OP > _commit: store addr > > ... > current->sched_rseq = NULL; > > > Then when preemption happens (from interrupt), the instruction pointer > is 'simply' reset to _start and it tries again. > > Anyway, this was aimed at arm64, which chose to use atomics for > this_cpu. But if we move sched_rseq() from schedule-tail into interrupt > entry, then this would also work for things like Power. > > Anyway, just throwing ideas out there. > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > index b57b2bb00967..080a868391b7 100644 > --- a/arch/arm64/include/asm/percpu.h > +++ b/arch/arm64/include/asm/percpu.h > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > static inline void set_my_cpu_offset(unsigned long off) > { > @@ -155,9 +156,23 @@ PERCPU_RET_OP(add, add, ldadd) > > #define _pcp_protect(op, pcp, ...) \ > ({ \ > - preempt_disable_notrace(); \ > + __label__ __rseq_begin; \ > + __label__ __rseq_end; \ > + static struct sched_rseq _R = { \ > + .begin = (unsigned long)&&__rseq_begin, \ > + .commit = (unsigned long)&&__rseq_end, \ > + .restart = (unsigned long)&&__rseq_begin, \ > + }; \ > + struct sched_rseq **this_rseq; \ > + asm ("mrs %0, sp_el0; add %0, %0, %1;" : "=r" (this_rseq) : "i" (TSK_rseq));\ > + *this_rseq = &_R; \ > +__rseq_begin: \ > + barrier(); \ > op(raw_cpu_ptr(&(pcp)), __VA_ARGS__); \ > - preempt_enable_notrace(); \ > + /* XXX broken */ \ > + barrier(); \ > +__rseq_end: \ > + *this_rseq = NULL; \ > }) > > #define _pcp_protect_return(op, pcp, args...) \ > diff --git a/include/linux/sched.h b/include/linux/sched.h > index a7b4a980eb2f..7960f3e21104 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -817,6 +817,12 @@ struct kmap_ctrl { > #endif > }; > > +struct sched_rseq { > + unsigned long begin; > + unsigned long commit; > + unsigned long restart; > +}; > + > struct task_struct { > #ifdef CONFIG_THREAD_INFO_IN_TASK > /* > @@ -827,6 +833,8 @@ struct task_struct { > #endif > unsigned int __state; > > + struct sched_rseq *sched_rseq; > + > /* saved state for "spinlock sleepers" */ > unsigned int saved_state; > > diff --git a/include/linux/sched/rseq.h b/include/linux/sched/rseq.h > deleted file mode 100644 > index e69de29bb2d1..000000000000 > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bfd280ec0f97..d4702f8590f2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5087,6 +5087,23 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, > prepare_arch_switch(next); > } > > +static inline void sched_rseq(struct task_struct *prev) > +{ > + struct sched_rseq *rseq = prev->sched_rseq; > + struct pt_regs *regs; > + unsigned long ip; > + > + if (likely(!rseq)) > + return; > + > + regs = task_pt_regs(prev); > + ip = instruction_pointer(regs); > + if ((ip - rseq->begin) >= (rseq->commit - rseq->begin)) > + return; > + > + instruction_pointer_set(regs, rseq->restart); > +} > + > /** > * finish_task_switch - clean up after a task-switch > * @prev: the thread we just switched away from. > @@ -5145,6 +5162,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) > prev_state = READ_ONCE(prev->__state); > vtime_task_switch(prev); > perf_event_task_sched_in(prev, current); > + sched_rseq(prev); > finish_task(prev); > tick_nohz_task_switch(); > finish_lock_switch(rq); > diff --git a/kernel/sched/rq-offsets.c b/kernel/sched/rq-offsets.c > index a23747bbe25b..629989a89395 100644 > --- a/kernel/sched/rq-offsets.c > +++ b/kernel/sched/rq-offsets.c > @@ -6,7 +6,8 @@ > > int main(void) > { > - DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned)); > + DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned)); > + DEFINE(TSK_rseq, offsetof(struct task_struct, sched_rseq)); > > return 0; > } >