* [PATCH rcu 0/11] Add light-weight readers for SRCU
@ 2024-09-03 16:32 Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 01/11] srcu: Rename srcu_might_be_idle() to srcu_should_expedite() Paul E. McKenney
` (13 more replies)
0 siblings, 14 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:32 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
Hello!
This series provides light-weight readers for SRCU. This lightness
is selected by the caller by using the new srcu_read_lock_lite() and
srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
srcu_read_unlock() flavors. Although this passes significant rcutorture
testing, this should still be considered to be experimental.
There are a few restrictions: (1) If srcu_read_lock_lite() is called
on a given srcu_struct structure, then no other flavor may be used on
that srcu_struct structure, before, during, or after. (2) The _lite()
readers may only be invoked from regions of code where RCU is watching
(as in those regions in which rcu_is_watching() returns true). (3)
There is no auto-expediting for srcu_struct structures that have
been passed to _lite() readers. (4) SRCU grace periods for _lite()
srcu_struct structures invoke synchronize_rcu() at least twice, thus
having longer latencies than their non-_lite() counterparts. (5) Even
with synchronize_srcu_expedited(), the resulting SRCU grace period
will invoke synchronize_rcu() at least twice, as opposed to invoking
the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
from NMI handlers (that is what the _nmisafe() interface are for).
Although one could imagine readers that were both _lite() and _nmisafe(),
one might also imagine that the read-modify-write atomic operations that
are needed by any NMI-safe SRCU read marker would make this unhelpful
from a performance perspective.
All that said, the patches in this series are as follows:
1. Rename srcu_might_be_idle() to srcu_should_expedite().
2. Introduce srcu_gp_is_expedited() helper function.
3. Renaming in preparation for additional reader flavor.
4. Bit manipulation changes for additional reader flavor.
5. Standardize srcu_data pointers to "sdp" and similar.
6. Convert srcu_data ->srcu_reader_flavor to bit field.
7. Add srcu_read_lock_lite() and srcu_read_unlock_lite().
8. rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits.
9. rcutorture: Add reader_flavor parameter for SRCU readers.
10. rcutorture: Add srcu_read_lock_lite() support to
rcutorture.reader_flavor.
11. refscale: Add srcu_read_lock_lite() support using "srcu-lite".
Thanx, Paul
------------------------------------------------------------------------
Documentation/admin-guide/kernel-parameters.txt | 4
b/Documentation/admin-guide/kernel-parameters.txt | 8 +
b/include/linux/srcu.h | 21 +-
b/include/linux/srcutree.h | 2
b/kernel/rcu/rcutorture.c | 28 +--
b/kernel/rcu/refscale.c | 54 +++++--
b/kernel/rcu/srcutree.c | 16 +-
include/linux/srcu.h | 86 +++++++++--
include/linux/srcutree.h | 5
kernel/rcu/rcutorture.c | 37 +++-
kernel/rcu/srcutree.c | 168 +++++++++++++++-------
11 files changed, 308 insertions(+), 121 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH rcu 01/11] srcu: Rename srcu_might_be_idle() to srcu_should_expedite()
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 02/11] srcu: Introduce srcu_gp_is_expedited() helper function Paul E. McKenney
` (12 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
SRCU auto-expedites grace periods that follow a sufficiently long idle
period, and the srcu_might_be_idle() function is used to make this
decision. However, the upcoming light-weight SRCU readers will not do
auto-expediting because doing so would cause the grace-period machinery
to invoke synchronize_rcu_expedited() twice, with IPIs all around.
However, software-engineering considerations force this determination
to remain in srcu_might_be_idle().
This commit therefore changes the name of srcu_might_be_idle() to
srcu_should_expedite(), thus moving from what it currently does to why
it does it, this latter being more future-proof.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
kernel/rcu/srcutree.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 78afaffd1b262..2fe0abade9c06 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1139,7 +1139,8 @@ static void srcu_flip(struct srcu_struct *ssp)
}
/*
- * If SRCU is likely idle, return true, otherwise return false.
+ * If SRCU is likely idle, in other words, the next SRCU grace period
+ * should be expedited, return true, otherwise return false.
*
* Note that it is OK for several current from-idle requests for a new
* grace period from idle to specify expediting because they will all end
@@ -1159,7 +1160,7 @@ static void srcu_flip(struct srcu_struct *ssp)
* negligible when amortized over that time period, and the extra latency
* of a needlessly non-expedited grace period is similarly negligible.
*/
-static bool srcu_might_be_idle(struct srcu_struct *ssp)
+static bool srcu_should_expedite(struct srcu_struct *ssp)
{
unsigned long curseq;
unsigned long flags;
@@ -1469,14 +1470,15 @@ EXPORT_SYMBOL_GPL(synchronize_srcu_expedited);
* Implementation of these memory-ordering guarantees is similar to
* that of synchronize_rcu().
*
- * If SRCU is likely idle, expedite the first request. This semantic
- * was provided by Classic SRCU, and is relied upon by its users, so TREE
- * SRCU must also provide it. Note that detecting idleness is heuristic
- * and subject to both false positives and negatives.
+ * If SRCU is likely idle as determined by srcu_should_expedite(),
+ * expedite the first request. This semantic was provided by Classic SRCU,
+ * and is relied upon by its users, so TREE SRCU must also provide it.
+ * Note that detecting idleness is heuristic and subject to both false
+ * positives and negatives.
*/
void synchronize_srcu(struct srcu_struct *ssp)
{
- if (srcu_might_be_idle(ssp) || rcu_gp_is_expedited())
+ if (srcu_should_expedite(ssp) || rcu_gp_is_expedited())
synchronize_srcu_expedited(ssp);
else
__synchronize_srcu(ssp, true);
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 02/11] srcu: Introduce srcu_gp_is_expedited() helper function
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 01/11] srcu: Rename srcu_might_be_idle() to srcu_should_expedite() Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 03/11] srcu: Renaming in preparation for additional reader flavor Paul E. McKenney
` (11 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
Even though the open-coded expressions usually fit on one line, this
commit replaces them with a call to a new srcu_gp_is_expedited()
helper function in order to improve readability.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
kernel/rcu/srcutree.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 2fe0abade9c06..5b1a315f77bc6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -418,6 +418,16 @@ static void check_init_srcu_struct(struct srcu_struct *ssp)
spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags);
}
+/*
+ * Is the current or any upcoming grace period to be expedited?
+ */
+static bool srcu_gp_is_expedited(struct srcu_struct *ssp)
+{
+ struct srcu_usage *sup = ssp->srcu_sup;
+
+ return ULONG_CMP_LT(READ_ONCE(sup->srcu_gp_seq), READ_ONCE(sup->srcu_gp_seq_needed_exp));
+}
+
/*
* Returns approximate total of the readers' ->srcu_lock_count[] values
* for the rank of per-CPU counters specified by idx.
@@ -622,7 +632,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp)
unsigned long jbase = SRCU_INTERVAL;
struct srcu_usage *sup = ssp->srcu_sup;
- if (ULONG_CMP_LT(READ_ONCE(sup->srcu_gp_seq), READ_ONCE(sup->srcu_gp_seq_needed_exp)))
+ if (srcu_gp_is_expedited(ssp))
jbase = 0;
if (rcu_seq_state(READ_ONCE(sup->srcu_gp_seq))) {
j = jiffies - 1;
@@ -867,7 +877,7 @@ static void srcu_gp_end(struct srcu_struct *ssp)
spin_lock_irq_rcu_node(sup);
idx = rcu_seq_state(sup->srcu_gp_seq);
WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
- if (ULONG_CMP_LT(READ_ONCE(sup->srcu_gp_seq), READ_ONCE(sup->srcu_gp_seq_needed_exp)))
+ if (srcu_gp_is_expedited(ssp))
cbdelay = 0;
WRITE_ONCE(sup->srcu_last_gp_end, ktime_get_mono_fast_ns());
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 03/11] srcu: Renaming in preparation for additional reader flavor
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 01/11] srcu: Rename srcu_might_be_idle() to srcu_should_expedite() Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 02/11] srcu: Introduce srcu_gp_is_expedited() helper function Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 04/11] srcu: Bit manipulation changes " Paul E. McKenney
` (10 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
Currently, there are only two flavors of readers, normal and NMI-safe.
A number of fields, functions, and types reflect this restriction.
This renaming-only commit prepares for the addition of light-weight
(as in memory-barrier-free) readers. OK, OK, there is also a drive-by
white-space fixeup!
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
include/linux/srcu.h | 21 ++++++++++-----------
include/linux/srcutree.h | 2 +-
kernel/rcu/srcutree.c | 22 +++++++++++-----------
3 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 835bbb2d1f88a..06728ef6f32a4 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -181,10 +181,9 @@ static inline int srcu_read_lock_held(const struct srcu_struct *ssp)
#define SRCU_NMI_SAFE 0x2
#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_TREE_SRCU)
-void srcu_check_nmi_safety(struct srcu_struct *ssp, bool nmi_safe);
+void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor);
#else
-static inline void srcu_check_nmi_safety(struct srcu_struct *ssp,
- bool nmi_safe) { }
+static inline void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor) { }
#endif
@@ -245,7 +244,7 @@ static inline int srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp)
{
int retval;
- srcu_check_nmi_safety(ssp, false);
+ srcu_check_read_flavor(ssp, false);
retval = __srcu_read_lock(ssp);
srcu_lock_acquire(&ssp->dep_map);
return retval;
@@ -262,7 +261,7 @@ static inline int srcu_read_lock_nmisafe(struct srcu_struct *ssp) __acquires(ssp
{
int retval;
- srcu_check_nmi_safety(ssp, true);
+ srcu_check_read_flavor(ssp, true);
retval = __srcu_read_lock_nmisafe(ssp);
rcu_try_lock_acquire(&ssp->dep_map);
return retval;
@@ -274,7 +273,7 @@ srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp)
{
int retval;
- srcu_check_nmi_safety(ssp, false);
+ srcu_check_read_flavor(ssp, false);
retval = __srcu_read_lock(ssp);
return retval;
}
@@ -303,7 +302,7 @@ srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp)
static inline int srcu_down_read(struct srcu_struct *ssp) __acquires(ssp)
{
WARN_ON_ONCE(in_nmi());
- srcu_check_nmi_safety(ssp, false);
+ srcu_check_read_flavor(ssp, false);
return __srcu_read_lock(ssp);
}
@@ -318,7 +317,7 @@ static inline void srcu_read_unlock(struct srcu_struct *ssp, int idx)
__releases(ssp)
{
WARN_ON_ONCE(idx & ~0x1);
- srcu_check_nmi_safety(ssp, false);
+ srcu_check_read_flavor(ssp, false);
srcu_lock_release(&ssp->dep_map);
__srcu_read_unlock(ssp, idx);
}
@@ -334,7 +333,7 @@ static inline void srcu_read_unlock_nmisafe(struct srcu_struct *ssp, int idx)
__releases(ssp)
{
WARN_ON_ONCE(idx & ~0x1);
- srcu_check_nmi_safety(ssp, true);
+ srcu_check_read_flavor(ssp, true);
rcu_lock_release(&ssp->dep_map);
__srcu_read_unlock_nmisafe(ssp, idx);
}
@@ -343,7 +342,7 @@ static inline void srcu_read_unlock_nmisafe(struct srcu_struct *ssp, int idx)
static inline notrace void
srcu_read_unlock_notrace(struct srcu_struct *ssp, int idx) __releases(ssp)
{
- srcu_check_nmi_safety(ssp, false);
+ srcu_check_read_flavor(ssp, false);
__srcu_read_unlock(ssp, idx);
}
@@ -360,7 +359,7 @@ static inline void srcu_up_read(struct srcu_struct *ssp, int idx)
{
WARN_ON_ONCE(idx & ~0x1);
WARN_ON_ONCE(in_nmi());
- srcu_check_nmi_safety(ssp, false);
+ srcu_check_read_flavor(ssp, false);
__srcu_read_unlock(ssp, idx);
}
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index ed57598394de3..ab7d8d215b84b 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -25,7 +25,7 @@ struct srcu_data {
/* Read-side state. */
atomic_long_t srcu_lock_count[2]; /* Locks per CPU. */
atomic_long_t srcu_unlock_count[2]; /* Unlocks per CPU. */
- int srcu_nmi_safety; /* NMI-safe srcu_struct structure? */
+ int srcu_reader_flavor; /* Reader flavor for srcu_struct structure? */
/* Update-side state. */
spinlock_t __private lock ____cacheline_internodealigned_in_smp;
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 5b1a315f77bc6..f259dd8342721 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -460,7 +460,7 @@ static unsigned long srcu_readers_unlock_idx(struct srcu_struct *ssp, int idx)
sum += atomic_long_read(&cpuc->srcu_unlock_count[idx]);
if (IS_ENABLED(CONFIG_PROVE_RCU))
- mask = mask | READ_ONCE(cpuc->srcu_nmi_safety);
+ mask = mask | READ_ONCE(cpuc->srcu_reader_flavor);
}
WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && (mask & (mask >> 1)),
"Mixed NMI-safe readers for srcu_struct at %ps.\n", ssp);
@@ -699,25 +699,25 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
#ifdef CONFIG_PROVE_RCU
/*
- * Check for consistent NMI safety.
+ * Check for consistent reader flavor.
*/
-void srcu_check_nmi_safety(struct srcu_struct *ssp, bool nmi_safe)
+void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor)
{
- int nmi_safe_mask = 1 << nmi_safe;
- int old_nmi_safe_mask;
+ int reader_flavor_mask = 1 << read_flavor;
+ int old_reader_flavor_mask;
struct srcu_data *sdp;
/* NMI-unsafe use in NMI is a bad sign */
- WARN_ON_ONCE(!nmi_safe && in_nmi());
+ WARN_ON_ONCE(!read_flavor && in_nmi());
sdp = raw_cpu_ptr(ssp->sda);
- old_nmi_safe_mask = READ_ONCE(sdp->srcu_nmi_safety);
- if (!old_nmi_safe_mask) {
- WRITE_ONCE(sdp->srcu_nmi_safety, nmi_safe_mask);
+ old_reader_flavor_mask = READ_ONCE(sdp->srcu_reader_flavor);
+ if (!old_reader_flavor_mask) {
+ WRITE_ONCE(sdp->srcu_reader_flavor, reader_flavor_mask);
return;
}
- WARN_ONCE(old_nmi_safe_mask != nmi_safe_mask, "CPU %d old state %d new state %d\n", sdp->cpu, old_nmi_safe_mask, nmi_safe_mask);
+ WARN_ONCE(old_reader_flavor_mask != reader_flavor_mask, "CPU %d old state %d new state %d\n", sdp->cpu, old_reader_flavor_mask, reader_flavor_mask);
}
-EXPORT_SYMBOL_GPL(srcu_check_nmi_safety);
+EXPORT_SYMBOL_GPL(srcu_check_read_flavor);
#endif /* CONFIG_PROVE_RCU */
/*
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 04/11] srcu: Bit manipulation changes for additional reader flavor
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (2 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 03/11] srcu: Renaming in preparation for additional reader flavor Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 05/11] srcu: Standardize srcu_data pointers to "sdp" and similar Paul E. McKenney
` (9 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
Currently, there are only two flavors of readers, normal and NMI-safe.
Very straightforward state updates suffice to check for erroneous
mixing of reader flavors on a given srcu_struct structure. This commit
upgrades the checking in preparation for the addition of light-weight
(as in memory-barrier-free) readers.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
kernel/rcu/srcutree.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index f259dd8342721..06ade8ca7804f 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -462,7 +462,7 @@ static unsigned long srcu_readers_unlock_idx(struct srcu_struct *ssp, int idx)
if (IS_ENABLED(CONFIG_PROVE_RCU))
mask = mask | READ_ONCE(cpuc->srcu_reader_flavor);
}
- WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && (mask & (mask >> 1)),
+ WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && (mask & (mask - 1)),
"Mixed NMI-safe readers for srcu_struct at %ps.\n", ssp);
return sum;
}
@@ -712,8 +712,10 @@ void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor)
sdp = raw_cpu_ptr(ssp->sda);
old_reader_flavor_mask = READ_ONCE(sdp->srcu_reader_flavor);
if (!old_reader_flavor_mask) {
- WRITE_ONCE(sdp->srcu_reader_flavor, reader_flavor_mask);
- return;
+ old_reader_flavor_mask = cmpxchg(&sdp->srcu_reader_flavor, 0, reader_flavor_mask);
+ if (!old_reader_flavor_mask) {
+ return;
+ }
}
WARN_ONCE(old_reader_flavor_mask != reader_flavor_mask, "CPU %d old state %d new state %d\n", sdp->cpu, old_reader_flavor_mask, reader_flavor_mask);
}
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 05/11] srcu: Standardize srcu_data pointers to "sdp" and similar
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (3 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 04/11] srcu: Bit manipulation changes " Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 06/11] srcu: Convert srcu_data ->srcu_reader_flavor to bit field Paul E. McKenney
` (8 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This commit changes a few "cpuc" variables to "sdp" to align wiht usage
elsewhere.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
kernel/rcu/srcutree.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 06ade8ca7804f..54ca2ea2b7d68 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -438,9 +438,9 @@ static unsigned long srcu_readers_lock_idx(struct srcu_struct *ssp, int idx)
unsigned long sum = 0;
for_each_possible_cpu(cpu) {
- struct srcu_data *cpuc = per_cpu_ptr(ssp->sda, cpu);
+ struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu);
- sum += atomic_long_read(&cpuc->srcu_lock_count[idx]);
+ sum += atomic_long_read(&sdp->srcu_lock_count[idx]);
}
return sum;
}
@@ -456,11 +456,11 @@ static unsigned long srcu_readers_unlock_idx(struct srcu_struct *ssp, int idx)
unsigned long sum = 0;
for_each_possible_cpu(cpu) {
- struct srcu_data *cpuc = per_cpu_ptr(ssp->sda, cpu);
+ struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu);
- sum += atomic_long_read(&cpuc->srcu_unlock_count[idx]);
+ sum += atomic_long_read(&sdp->srcu_unlock_count[idx]);
if (IS_ENABLED(CONFIG_PROVE_RCU))
- mask = mask | READ_ONCE(cpuc->srcu_reader_flavor);
+ mask = mask | READ_ONCE(sdp->srcu_reader_flavor);
}
WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && (mask & (mask - 1)),
"Mixed NMI-safe readers for srcu_struct at %ps.\n", ssp);
@@ -564,12 +564,12 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
unsigned long sum = 0;
for_each_possible_cpu(cpu) {
- struct srcu_data *cpuc = per_cpu_ptr(ssp->sda, cpu);
+ struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu);
- sum += atomic_long_read(&cpuc->srcu_lock_count[0]);
- sum += atomic_long_read(&cpuc->srcu_lock_count[1]);
- sum -= atomic_long_read(&cpuc->srcu_unlock_count[0]);
- sum -= atomic_long_read(&cpuc->srcu_unlock_count[1]);
+ sum += atomic_long_read(&sdp->srcu_lock_count[0]);
+ sum += atomic_long_read(&sdp->srcu_lock_count[1]);
+ sum -= atomic_long_read(&sdp->srcu_unlock_count[0]);
+ sum -= atomic_long_read(&sdp->srcu_unlock_count[1]);
}
return sum;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 06/11] srcu: Convert srcu_data ->srcu_reader_flavor to bit field
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (4 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 05/11] srcu: Standardize srcu_data pointers to "sdp" and similar Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite() Paul E. McKenney
` (7 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This commit adds SRCU_READ_FLAVOR_NORMAL and SRCU_READ_FLAVOR_NMI bit
definitions and uses them in preparation for adding a third SRCU reader
flavor.
While in the area, improve a few comments.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
include/linux/srcu.h | 35 +++++++++++++++++++----------------
include/linux/srcutree.h | 4 ++++
kernel/rcu/srcutree.c | 22 +++++++++++-----------
3 files changed, 34 insertions(+), 27 deletions(-)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 06728ef6f32a4..84daaa33ea0ab 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -176,10 +176,6 @@ static inline int srcu_read_lock_held(const struct srcu_struct *ssp)
#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
-#define SRCU_NMI_UNKNOWN 0x0
-#define SRCU_NMI_UNSAFE 0x1
-#define SRCU_NMI_SAFE 0x2
-
#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_TREE_SRCU)
void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor);
#else
@@ -235,16 +231,19 @@ static inline void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flav
* a mutex that is held elsewhere while calling synchronize_srcu() or
* synchronize_srcu_expedited().
*
- * Note that srcu_read_lock() and the matching srcu_read_unlock() must
- * occur in the same context, for example, it is illegal to invoke
- * srcu_read_unlock() in an irq handler if the matching srcu_read_lock()
- * was invoked in process context.
+ * The return value from srcu_read_lock() must be passed unaltered
+ * to the matching srcu_read_unlock(). Note that srcu_read_lock() and
+ * the matching srcu_read_unlock() must occur in the same context, for
+ * example, it is illegal to invoke srcu_read_unlock() in an irq handler
+ * if the matching srcu_read_lock() was invoked in process context. Or,
+ * for that matter to invoke srcu_read_unlock() from one task and the
+ * matching srcu_read_lock() from another.
*/
static inline int srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp)
{
int retval;
- srcu_check_read_flavor(ssp, false);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NORMAL);
retval = __srcu_read_lock(ssp);
srcu_lock_acquire(&ssp->dep_map);
return retval;
@@ -256,12 +255,16 @@ static inline int srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp)
*
* Enter an SRCU read-side critical section, but in an NMI-safe manner.
* See srcu_read_lock() for more information.
+ *
+ * If srcu_read_lock_nmisafe() is ever used on an srcu_struct structure,
+ * then none of the other flavors may be used, whether before, during,
+ * or after.
*/
static inline int srcu_read_lock_nmisafe(struct srcu_struct *ssp) __acquires(ssp)
{
int retval;
- srcu_check_read_flavor(ssp, true);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NMI);
retval = __srcu_read_lock_nmisafe(ssp);
rcu_try_lock_acquire(&ssp->dep_map);
return retval;
@@ -273,7 +276,7 @@ srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp)
{
int retval;
- srcu_check_read_flavor(ssp, false);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NORMAL);
retval = __srcu_read_lock(ssp);
return retval;
}
@@ -302,7 +305,7 @@ srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp)
static inline int srcu_down_read(struct srcu_struct *ssp) __acquires(ssp)
{
WARN_ON_ONCE(in_nmi());
- srcu_check_read_flavor(ssp, false);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NORMAL);
return __srcu_read_lock(ssp);
}
@@ -317,7 +320,7 @@ static inline void srcu_read_unlock(struct srcu_struct *ssp, int idx)
__releases(ssp)
{
WARN_ON_ONCE(idx & ~0x1);
- srcu_check_read_flavor(ssp, false);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NORMAL);
srcu_lock_release(&ssp->dep_map);
__srcu_read_unlock(ssp, idx);
}
@@ -333,7 +336,7 @@ static inline void srcu_read_unlock_nmisafe(struct srcu_struct *ssp, int idx)
__releases(ssp)
{
WARN_ON_ONCE(idx & ~0x1);
- srcu_check_read_flavor(ssp, true);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NMI);
rcu_lock_release(&ssp->dep_map);
__srcu_read_unlock_nmisafe(ssp, idx);
}
@@ -342,7 +345,7 @@ static inline void srcu_read_unlock_nmisafe(struct srcu_struct *ssp, int idx)
static inline notrace void
srcu_read_unlock_notrace(struct srcu_struct *ssp, int idx) __releases(ssp)
{
- srcu_check_read_flavor(ssp, false);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NORMAL);
__srcu_read_unlock(ssp, idx);
}
@@ -359,7 +362,7 @@ static inline void srcu_up_read(struct srcu_struct *ssp, int idx)
{
WARN_ON_ONCE(idx & ~0x1);
WARN_ON_ONCE(in_nmi());
- srcu_check_read_flavor(ssp, false);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_NORMAL);
__srcu_read_unlock(ssp, idx);
}
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index ab7d8d215b84b..79ad809c7f035 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -43,6 +43,10 @@ struct srcu_data {
struct srcu_struct *ssp;
};
+/* Values for ->srcu_reader_flavor. */
+#define SRCU_READ_FLAVOR_NORMAL 0x1 // srcu_read_lock().
+#define SRCU_READ_FLAVOR_NMI 0x2 // srcu_read_lock_nmisafe().
+
/*
* Node in SRCU combining tree, similar in function to rcu_data.
*/
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 54ca2ea2b7d68..602b4b8c4b891 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -463,7 +463,7 @@ static unsigned long srcu_readers_unlock_idx(struct srcu_struct *ssp, int idx)
mask = mask | READ_ONCE(sdp->srcu_reader_flavor);
}
WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && (mask & (mask - 1)),
- "Mixed NMI-safe readers for srcu_struct at %ps.\n", ssp);
+ "Mixed reader flavors for srcu_struct at %ps.\n", ssp);
return sum;
}
@@ -703,21 +703,21 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
*/
void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor)
{
- int reader_flavor_mask = 1 << read_flavor;
- int old_reader_flavor_mask;
+ int old_read_flavor;
struct srcu_data *sdp;
- /* NMI-unsafe use in NMI is a bad sign */
- WARN_ON_ONCE(!read_flavor && in_nmi());
+ /* NMI-unsafe use in NMI is a bad sign, as is multi-bit read_flavor values. */
+ WARN_ON_ONCE((read_flavor != SRCU_READ_FLAVOR_NMI) && in_nmi());
+ WARN_ON_ONCE(read_flavor & (read_flavor - 1));
+
sdp = raw_cpu_ptr(ssp->sda);
- old_reader_flavor_mask = READ_ONCE(sdp->srcu_reader_flavor);
- if (!old_reader_flavor_mask) {
- old_reader_flavor_mask = cmpxchg(&sdp->srcu_reader_flavor, 0, reader_flavor_mask);
- if (!old_reader_flavor_mask) {
+ old_read_flavor = READ_ONCE(sdp->srcu_reader_flavor);
+ if (!old_read_flavor) {
+ old_read_flavor = cmpxchg(&sdp->srcu_reader_flavor, 0, read_flavor);
+ if (!old_read_flavor)
return;
- }
}
- WARN_ONCE(old_reader_flavor_mask != reader_flavor_mask, "CPU %d old state %d new state %d\n", sdp->cpu, old_reader_flavor_mask, reader_flavor_mask);
+ WARN_ONCE(old_read_flavor != read_flavor, "CPU %d old state %d new state %d\n", sdp->cpu, old_read_flavor, read_flavor);
}
EXPORT_SYMBOL_GPL(srcu_check_read_flavor);
#endif /* CONFIG_PROVE_RCU */
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (5 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 06/11] srcu: Convert srcu_data ->srcu_reader_flavor to bit field Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 19:45 ` Alexei Starovoitov
2024-09-03 16:33 ` [PATCH rcu 08/11] rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits Paul E. McKenney
` (6 subsequent siblings)
13 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This patch adds srcu_read_lock_lite() and srcu_read_unlock_lite(), which
dispense with the read-side smp_mb() but also are restricted to code
regions that RCU is watching. If a given srcu_struct structure uses
srcu_read_lock_lite() and srcu_read_unlock_lite(), it is not permitted
to use any other SRCU read-side marker, before, during, or after.
Another price of light-weight readers is heavier weight grace periods.
Such readers mean that SRCU grace periods on srcu_struct structures
used by light-weight readers will incur at least two calls to
synchronize_rcu(). In addition, normal SRCU grace periods for
light-weight-reader srcu_struct structures never auto-expedite.
Note that expedited SRCU grace periods for light-weight-reader
srcu_struct structures still invoke synchronize_rcu(), not
synchronize_srcu_expedited(). Something about wishing to keep
the IPIs down to a dull roar.
The srcu_read_lock_lite() and srcu_read_unlock_lite() functions may not
(repeat, *not*) be used from NMI handlers, but if this is needed, an
additional flavor of SRCU reader can be added by some future commit.
[ paulmck: Apply Alexei Starovoitov expediting feedback. ]
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
include/linux/srcu.h | 51 ++++++++++++++++++++++++-
include/linux/srcutree.h | 1 +
kernel/rcu/srcutree.c | 82 ++++++++++++++++++++++++++++++++++------
3 files changed, 122 insertions(+), 12 deletions(-)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 84daaa33ea0ab..4ba96e2cfa405 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -56,6 +56,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
void cleanup_srcu_struct(struct srcu_struct *ssp);
int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp);
void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
+#ifdef CONFIG_TINY_SRCU
+#define __srcu_read_lock_lite __srcu_read_lock
+#define __srcu_read_unlock_lite __srcu_read_unlock
+#else // #ifdef CONFIG_TINY_SRCU
+int __srcu_read_lock_lite(struct srcu_struct *ssp) __acquires(ssp);
+void __srcu_read_unlock_lite(struct srcu_struct *ssp, int idx) __releases(ssp);
+#endif // #else // #ifdef CONFIG_TINY_SRCU
void synchronize_srcu(struct srcu_struct *ssp);
#define SRCU_GET_STATE_COMPLETED 0x1
@@ -179,7 +186,7 @@ static inline int srcu_read_lock_held(const struct srcu_struct *ssp)
#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_TREE_SRCU)
void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor);
#else
-static inline void srcu_check_read_flavor(struct srcu_struct *ssp, int read_flavor) { }
+#define srcu_check_read_flavor(ssp, read_flavor) do { } while (0)
#endif
@@ -249,6 +256,32 @@ static inline int srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp)
return retval;
}
+/**
+ * srcu_read_lock_lite - register a new reader for an SRCU-protected structure.
+ * @ssp: srcu_struct in which to register the new reader.
+ *
+ * Enter an SRCU read-side critical section, but for a light-weight
+ * smp_mb()-free reader. See srcu_read_lock() for more information.
+ *
+ * If srcu_read_lock_lite() is ever used on an srcu_struct structure,
+ * then none of the other flavors may be used, whether before, during,
+ * or after. Note that grace-period auto-expediting is disabled for _lite
+ * srcu_struct structures because auto-expedited grace periods invoke
+ * synchronize_rcu_expedited(), IPIs and all.
+ *
+ * Note that srcu_read_lock_lite() can be invoked only from those contexts
+ * where RCU is watching. Otherwise, lockdep will complain.
+ */
+static inline int srcu_read_lock_lite(struct srcu_struct *ssp) __acquires(ssp)
+{
+ int retval;
+
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_LITE);
+ retval = __srcu_read_lock_lite(ssp);
+ rcu_try_lock_acquire(&ssp->dep_map);
+ return retval;
+}
+
/**
* srcu_read_lock_nmisafe - register a new reader for an SRCU-protected structure.
* @ssp: srcu_struct in which to register the new reader.
@@ -325,6 +358,22 @@ static inline void srcu_read_unlock(struct srcu_struct *ssp, int idx)
__srcu_read_unlock(ssp, idx);
}
+/**
+ * srcu_read_unlock_lite - unregister a old reader from an SRCU-protected structure.
+ * @ssp: srcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding srcu_read_lock().
+ *
+ * Exit a light-weight SRCU read-side critical section.
+ */
+static inline void srcu_read_unlock_lite(struct srcu_struct *ssp, int idx)
+ __releases(ssp)
+{
+ WARN_ON_ONCE(idx & ~0x1);
+ srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_LITE);
+ srcu_lock_release(&ssp->dep_map);
+ __srcu_read_unlock(ssp, idx);
+}
+
/**
* srcu_read_unlock_nmisafe - unregister a old reader from an SRCU-protected structure.
* @ssp: srcu_struct in which to unregister the old reader.
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 79ad809c7f035..8074138cbd624 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -46,6 +46,7 @@ struct srcu_data {
/* Values for ->srcu_reader_flavor. */
#define SRCU_READ_FLAVOR_NORMAL 0x1 // srcu_read_lock().
#define SRCU_READ_FLAVOR_NMI 0x2 // srcu_read_lock_nmisafe().
+#define SRCU_READ_FLAVOR_LITE 0x4 // srcu_read_lock_lite().
/*
* Node in SRCU combining tree, similar in function to rcu_data.
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 602b4b8c4b891..bab888e86b9bb 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -429,20 +429,29 @@ static bool srcu_gp_is_expedited(struct srcu_struct *ssp)
}
/*
- * Returns approximate total of the readers' ->srcu_lock_count[] values
- * for the rank of per-CPU counters specified by idx.
+ * Computes approximate total of the readers' ->srcu_lock_count[] values
+ * for the rank of per-CPU counters specified by idx, and returns true if
+ * the caller did the proper barrier (gp), and if the count of the locks
+ * matches that of the unlocks passed in.
*/
-static unsigned long srcu_readers_lock_idx(struct srcu_struct *ssp, int idx)
+static bool srcu_readers_lock_idx(struct srcu_struct *ssp, int idx, bool gp, unsigned long unlocks)
{
int cpu;
+ unsigned long mask = 0;
unsigned long sum = 0;
for_each_possible_cpu(cpu) {
struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu);
sum += atomic_long_read(&sdp->srcu_lock_count[idx]);
+ if (IS_ENABLED(CONFIG_PROVE_RCU))
+ mask = mask | READ_ONCE(sdp->srcu_reader_flavor);
}
- return sum;
+ WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && (mask & (mask - 1)),
+ "Mixed reader flavors for srcu_struct at %ps.\n", ssp);
+ if (mask & SRCU_READ_FLAVOR_LITE && !gp)
+ return false;
+ return sum == unlocks;
}
/*
@@ -473,6 +482,7 @@ static unsigned long srcu_readers_unlock_idx(struct srcu_struct *ssp, int idx)
*/
static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
{
+ bool did_gp = !!(__this_cpu_read(ssp->sda->srcu_reader_flavor) & SRCU_READ_FLAVOR_LITE);
unsigned long unlocks;
unlocks = srcu_readers_unlock_idx(ssp, idx);
@@ -482,13 +492,16 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
* unlock is counted. Needs to be a smp_mb() as the read side may
* contain a read from a variable that is written to before the
* synchronize_srcu() in the write side. In this case smp_mb()s
- * A and B act like the store buffering pattern.
+ * A and B (or X and Y) act like the store buffering pattern.
*
- * This smp_mb() also pairs with smp_mb() C to prevent accesses
- * after the synchronize_srcu() from being executed before the
- * grace period ends.
+ * This smp_mb() also pairs with smp_mb() C (or, in the case of X,
+ * Z) to prevent accesses after the synchronize_srcu() from being
+ * executed before the grace period ends.
*/
- smp_mb(); /* A */
+ if (!did_gp)
+ smp_mb(); /* A */
+ else
+ synchronize_rcu(); /* X */
/*
* If the locks are the same as the unlocks, then there must have
@@ -546,7 +559,7 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
* which are unlikely to be configured with an address space fully
* populated with memory, at least not anytime soon.
*/
- return srcu_readers_lock_idx(ssp, idx) == unlocks;
+ return srcu_readers_lock_idx(ssp, idx, did_gp, unlocks);
}
/**
@@ -750,6 +763,47 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
}
EXPORT_SYMBOL_GPL(__srcu_read_unlock);
+/*
+ * Counts the new reader in the appropriate per-CPU element of the
+ * srcu_struct. Returns an index that must be passed to the matching
+ * srcu_read_unlock_lite().
+ *
+ * Note that this_cpu_inc() is an RCU read-side critical section either
+ * because it disables interrupts, because it is a single instruction,
+ * or because it is a read-modify-write atomic operation, depending on
+ * the whims of the architecture.
+ */
+int __srcu_read_lock_lite(struct srcu_struct *ssp)
+{
+ int idx;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_lock_lite().");
+ idx = READ_ONCE(ssp->srcu_idx) & 0x1;
+ this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); /* Y */
+ barrier(); /* Avoid leaking the critical section. */
+ return idx;
+}
+EXPORT_SYMBOL_GPL(__srcu_read_lock_lite);
+
+/*
+ * Removes the count for the old reader from the appropriate
+ * per-CPU element of the srcu_struct. Note that this may well be a
+ * different CPU than that which was incremented by the corresponding
+ * srcu_read_lock_lite(), but it must be within the same task.
+ *
+ * Note that this_cpu_inc() is an RCU read-side critical section either
+ * because it disables interrupts, because it is a single instruction,
+ * or because it is a read-modify-write atomic operation, depending on
+ * the whims of the architecture.
+ */
+void __srcu_read_unlock_lite(struct srcu_struct *ssp, int idx)
+{
+ barrier(); /* Avoid leaking the critical section. */
+ this_cpu_inc(ssp->sda->srcu_unlock_count[idx].counter); /* Z */
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_unlock_lite().");
+}
+EXPORT_SYMBOL_GPL(__srcu_read_unlock_lite);
+
#ifdef CONFIG_NEED_SRCU_NMI_SAFE
/*
@@ -1134,6 +1188,8 @@ static void srcu_flip(struct srcu_struct *ssp)
* it stays until either (1) Compilers learn about this sort of
* control dependency or (2) Some production workload running on
* a production system is unduly delayed by this slowpath smp_mb().
+ * Except for _lite() readers, where it is inoperative, which
+ * means that it is a good thing that it is redundant.
*/
smp_mb(); /* E */ /* Pairs with B and C. */
@@ -1152,7 +1208,8 @@ static void srcu_flip(struct srcu_struct *ssp)
/*
* If SRCU is likely idle, in other words, the next SRCU grace period
- * should be expedited, return true, otherwise return false.
+ * should be expedited, return true, otherwise return false. Except that
+ * in the presence of _lite() readers, always return false.
*
* Note that it is OK for several current from-idle requests for a new
* grace period from idle to specify expediting because they will all end
@@ -1181,6 +1238,9 @@ static bool srcu_should_expedite(struct srcu_struct *ssp)
unsigned long tlast;
check_init_srcu_struct(ssp);
+ /* If _lite() readers, don't do unsolicited expediting. */
+ if (this_cpu_read(ssp->sda->srcu_reader_flavor) & SRCU_READ_FLAVOR_LITE)
+ return false;
/* If the local srcu_data structure has callbacks, not idle. */
sdp = raw_cpu_ptr(ssp->sda);
spin_lock_irqsave_rcu_node(sdp, flags);
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 08/11] rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (6 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite() Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 09/11] rcutorture: Add reader_flavor parameter for SRCU readers Paul E. McKenney
` (5 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This commit prepares for testing of multiple SRCU reader flavors by
expanding RCUTORTURE_RDR_MASK_1 and RCUTORTURE_RDR_MASK_2 from a single
bit to eight bits, allowing them to accommodate the return values from
multiple calls to srcu_read_lock*(). This will in turn permit better
testing coverage for these SRCU reader flavors, including testing of
the diagnostics for inproper use of mixed reader flavors.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
kernel/rcu/rcutorture.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index b4cb7623a8bfc..d883f01407178 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -57,9 +57,9 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com> and Josh Triplett <josh@
/* Bits for ->extendables field, extendables param, and related definitions. */
#define RCUTORTURE_RDR_SHIFT_1 8 /* Put SRCU index in upper bits. */
-#define RCUTORTURE_RDR_MASK_1 (1 << RCUTORTURE_RDR_SHIFT_1)
-#define RCUTORTURE_RDR_SHIFT_2 9 /* Put SRCU index in upper bits. */
-#define RCUTORTURE_RDR_MASK_2 (1 << RCUTORTURE_RDR_SHIFT_2)
+#define RCUTORTURE_RDR_MASK_1 (0xff << RCUTORTURE_RDR_SHIFT_1)
+#define RCUTORTURE_RDR_SHIFT_2 16 /* Put SRCU index in upper bits. */
+#define RCUTORTURE_RDR_MASK_2 (0xff << RCUTORTURE_RDR_SHIFT_2)
#define RCUTORTURE_RDR_BH 0x01 /* Extend readers by disabling bh. */
#define RCUTORTURE_RDR_IRQ 0x02 /* ... disabling interrupts. */
#define RCUTORTURE_RDR_PREEMPT 0x04 /* ... disabling preemption. */
@@ -71,6 +71,9 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com> and Josh Triplett <josh@
#define RCUTORTURE_MAX_EXTEND \
(RCUTORTURE_RDR_BH | RCUTORTURE_RDR_IRQ | RCUTORTURE_RDR_PREEMPT | \
RCUTORTURE_RDR_RBH | RCUTORTURE_RDR_SCHED)
+#define RCUTORTURE_RDR_ALLBITS \
+ (RCUTORTURE_MAX_EXTEND | RCUTORTURE_RDR_RCU_1 | RCUTORTURE_RDR_RCU_2 | \
+ RCUTORTURE_RDR_MASK_1 | RCUTORTURE_RDR_MASK_2)
#define RCUTORTURE_RDR_MAX_LOOPS 0x7 /* Maximum reader extensions. */
/* Must be power of two minus one. */
#define RCUTORTURE_RDR_MAX_SEGS (RCUTORTURE_RDR_MAX_LOOPS + 3)
@@ -1830,7 +1833,7 @@ static void rcutorture_one_extend(int *readstate, int newstate,
int statesold = *readstate & ~newstate;
WARN_ON_ONCE(idxold2 < 0);
- WARN_ON_ONCE((idxold2 >> RCUTORTURE_RDR_SHIFT_2) > 1);
+ WARN_ON_ONCE(idxold2 & ~RCUTORTURE_RDR_ALLBITS);
rtrsp->rt_readstate = newstate;
/* First, put new protection in place to avoid critical-section gap. */
@@ -1845,9 +1848,9 @@ static void rcutorture_one_extend(int *readstate, int newstate,
if (statesnew & RCUTORTURE_RDR_SCHED)
rcu_read_lock_sched();
if (statesnew & RCUTORTURE_RDR_RCU_1)
- idxnew1 = (cur_ops->readlock() & 0x1) << RCUTORTURE_RDR_SHIFT_1;
+ idxnew1 = (cur_ops->readlock() << RCUTORTURE_RDR_SHIFT_1) & RCUTORTURE_RDR_MASK_1;
if (statesnew & RCUTORTURE_RDR_RCU_2)
- idxnew2 = (cur_ops->readlock() & 0x1) << RCUTORTURE_RDR_SHIFT_2;
+ idxnew2 = (cur_ops->readlock() << RCUTORTURE_RDR_SHIFT_2) & RCUTORTURE_RDR_MASK_2;
/*
* Next, remove old protection, in decreasing order of strength
@@ -1867,7 +1870,7 @@ static void rcutorture_one_extend(int *readstate, int newstate,
if (statesold & RCUTORTURE_RDR_RBH)
rcu_read_unlock_bh();
if (statesold & RCUTORTURE_RDR_RCU_2) {
- cur_ops->readunlock((idxold2 >> RCUTORTURE_RDR_SHIFT_2) & 0x1);
+ cur_ops->readunlock((idxold2 & RCUTORTURE_RDR_MASK_2) >> RCUTORTURE_RDR_SHIFT_2);
WARN_ON_ONCE(idxnew2 != -1);
idxold2 = 0;
}
@@ -1877,7 +1880,7 @@ static void rcutorture_one_extend(int *readstate, int newstate,
lockit = !cur_ops->no_pi_lock && !statesnew && !(torture_random(trsp) & 0xffff);
if (lockit)
raw_spin_lock_irqsave(¤t->pi_lock, flags);
- cur_ops->readunlock((idxold1 >> RCUTORTURE_RDR_SHIFT_1) & 0x1);
+ cur_ops->readunlock((idxold1 & RCUTORTURE_RDR_MASK_1) >> RCUTORTURE_RDR_SHIFT_1);
WARN_ON_ONCE(idxnew1 != -1);
idxold1 = 0;
if (lockit)
@@ -1892,16 +1895,13 @@ static void rcutorture_one_extend(int *readstate, int newstate,
if (idxnew1 == -1)
idxnew1 = idxold1 & RCUTORTURE_RDR_MASK_1;
WARN_ON_ONCE(idxnew1 < 0);
- if (WARN_ON_ONCE((idxnew1 >> RCUTORTURE_RDR_SHIFT_1) > 1))
- pr_info("Unexpected idxnew1 value of %#x\n", idxnew1);
if (idxnew2 == -1)
idxnew2 = idxold2 & RCUTORTURE_RDR_MASK_2;
WARN_ON_ONCE(idxnew2 < 0);
- WARN_ON_ONCE((idxnew2 >> RCUTORTURE_RDR_SHIFT_2) > 1);
*readstate = idxnew1 | idxnew2 | newstate;
WARN_ON_ONCE(*readstate < 0);
- if (WARN_ON_ONCE((*readstate >> RCUTORTURE_RDR_SHIFT_2) > 1))
- pr_info("Unexpected idxnew2 value of %#x\n", idxnew2);
+ if (WARN_ON_ONCE(*readstate & ~RCUTORTURE_RDR_ALLBITS))
+ pr_info("Unexpected readstate value of %#x\n", *readstate);
}
/* Return the biggest extendables mask given current RCU and boot parameters. */
@@ -1926,7 +1926,7 @@ rcutorture_extend_mask(int oldmask, struct torture_random_state *trsp)
unsigned long preempts_irq = preempts | RCUTORTURE_RDR_IRQ;
unsigned long bhs = RCUTORTURE_RDR_BH | RCUTORTURE_RDR_RBH;
- WARN_ON_ONCE(mask >> RCUTORTURE_RDR_SHIFT_1);
+ WARN_ON_ONCE(mask >> RCUTORTURE_RDR_SHIFT_1); // Can't have reader idx bits.
/* Mostly only one bit (need preemption!), sometimes lots of bits. */
if (!(randmask1 & 0x7))
mask = mask & randmask2;
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 09/11] rcutorture: Add reader_flavor parameter for SRCU readers
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (7 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 08/11] rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 10/11] rcutorture: Add srcu_read_lock_lite() support to rcutorture.reader_flavor Paul E. McKenney
` (4 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This commit adds an rcutorture.reader_flavor parameter whose bits
correspond to reader flavors. For example, SRCU's readers are 0x1 for
normal and 0x2 for NMI-safe.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
.../admin-guide/kernel-parameters.txt | 8 +++++
kernel/rcu/rcutorture.c | 30 ++++++++++++++-----
2 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f8f0800852585..e107c82f0b21b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5352,6 +5352,14 @@
The delay, in seconds, between successive
read-then-exit testing episodes.
+ rcutorture.reader_flavor= [KNL]
+ A bit mask indicating which readers to use.
+ If there is more than one bit set, the readers
+ are entered from low-order bit up, and are
+ exited in the opposite order. For SRCU, the
+ 0x1 bit is normal readers and the 0x2 bit is
+ for NMI-safe readers.
+
rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index d883f01407178..1a3e0fdca7139 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -111,6 +111,7 @@ torture_param(int, nocbs_nthreads, 0, "Number of NOCB toggle threads, 0 to disab
torture_param(int, nocbs_toggle, 1000, "Time between toggling nocb state (ms)");
torture_param(int, read_exit_delay, 13, "Delay between read-then-exit episodes (s)");
torture_param(int, read_exit_burst, 16, "# of read-then-exit bursts per episode, zero to disable");
+torture_param(int, reader_flavor, 0x1, "Reader flavors to use, one per bit.");
torture_param(int, shuffle_interval, 3, "Number of seconds between shuffles");
torture_param(int, shutdown_secs, 0, "Shutdown time (s), <= zero to disable.");
torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable.");
@@ -646,10 +647,20 @@ static void srcu_get_gp_data(int *flags, unsigned long *gp_seq)
static int srcu_torture_read_lock(void)
{
- if (cur_ops == &srcud_ops)
- return srcu_read_lock_nmisafe(srcu_ctlp);
- else
- return srcu_read_lock(srcu_ctlp);
+ int idx;
+ int ret = 0;
+
+ if ((reader_flavor & 0x1) || !(reader_flavor & 0x7)) {
+ idx = srcu_read_lock(srcu_ctlp);
+ WARN_ON_ONCE(idx & ~0x1);
+ ret += idx;
+ }
+ if (reader_flavor & 0x2) {
+ idx = srcu_read_lock_nmisafe(srcu_ctlp);
+ WARN_ON_ONCE(idx & ~0x1);
+ ret += idx << 1;
+ }
+ return ret;
}
static void
@@ -673,10 +684,11 @@ srcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp)
static void srcu_torture_read_unlock(int idx)
{
- if (cur_ops == &srcud_ops)
- srcu_read_unlock_nmisafe(srcu_ctlp, idx);
- else
- srcu_read_unlock(srcu_ctlp, idx);
+ WARN_ON_ONCE((reader_flavor && (idx & ~reader_flavor)) || (!reader_flavor && (idx & ~0x1)));
+ if (reader_flavor & 0x2)
+ srcu_read_unlock_nmisafe(srcu_ctlp, (idx & 0x2) >> 1);
+ if ((reader_flavor & 0x1) || !(reader_flavor & 0x7))
+ srcu_read_unlock(srcu_ctlp, idx & 0x1);
}
static int torture_srcu_read_lock_held(void)
@@ -2399,6 +2411,7 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag)
"n_barrier_cbs=%d "
"onoff_interval=%d onoff_holdoff=%d "
"read_exit_delay=%d read_exit_burst=%d "
+ "reader_flavor=%x "
"nocbs_nthreads=%d nocbs_toggle=%d "
"test_nmis=%d\n",
torture_type, tag, nrealreaders, nfakewriters,
@@ -2411,6 +2424,7 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag)
n_barrier_cbs,
onoff_interval, onoff_holdoff,
read_exit_delay, read_exit_burst,
+ reader_flavor,
nocbs_nthreads, nocbs_toggle,
test_nmis);
}
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 10/11] rcutorture: Add srcu_read_lock_lite() support to rcutorture.reader_flavor
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (8 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 09/11] rcutorture: Add reader_flavor parameter for SRCU readers Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 11/11] refscale: Add srcu_read_lock_lite() support using "srcu-lite" Paul E. McKenney
` (3 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This commit causes bit 0x4 of rcutorture.reader_flavor to select the new
srcu_read_lock_lite() and srcu_read_unlock_lite() functions.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
Documentation/admin-guide/kernel-parameters.txt | 4 ++--
kernel/rcu/rcutorture.c | 7 +++++++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e107c82f0b21b..39bf8ce17e992 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5357,8 +5357,8 @@
If there is more than one bit set, the readers
are entered from low-order bit up, and are
exited in the opposite order. For SRCU, the
- 0x1 bit is normal readers and the 0x2 bit is
- for NMI-safe readers.
+ 0x1 bit is normal readers, 0x2 NMI-safe readers,
+ and 0x4 light-weight readers.
rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 1a3e0fdca7139..306a449bbad87 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -660,6 +660,11 @@ static int srcu_torture_read_lock(void)
WARN_ON_ONCE(idx & ~0x1);
ret += idx << 1;
}
+ if (reader_flavor & 0x4) {
+ idx = srcu_read_lock_lite(srcu_ctlp);
+ WARN_ON_ONCE(idx & ~0x1);
+ ret += idx << 2;
+ }
return ret;
}
@@ -685,6 +690,8 @@ srcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp)
static void srcu_torture_read_unlock(int idx)
{
WARN_ON_ONCE((reader_flavor && (idx & ~reader_flavor)) || (!reader_flavor && (idx & ~0x1)));
+ if (reader_flavor & 0x4)
+ srcu_read_unlock_lite(srcu_ctlp, (idx & 0x4) >> 2);
if (reader_flavor & 0x2)
srcu_read_unlock_nmisafe(srcu_ctlp, (idx & 0x2) >> 1);
if ((reader_flavor & 0x1) || !(reader_flavor & 0x7))
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH rcu 11/11] refscale: Add srcu_read_lock_lite() support using "srcu-lite"
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (9 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 10/11] rcutorture: Add srcu_read_lock_lite() support to rcutorture.reader_flavor Paul E. McKenney
@ 2024-09-03 16:33 ` Paul E. McKenney
2024-09-03 21:38 ` [PATCH rcu 0/11] Add light-weight readers for SRCU Kent Overstreet
` (2 subsequent siblings)
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 16:33 UTC (permalink / raw)
To: rcu
Cc: linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf,
Paul E. McKenney
This commit creates a new srcu-lite option for the refscale.scale_type
module parameter that selects srcu_read_lock_lite() and
srcu_read_unlock_lite().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
---
kernel/rcu/refscale.c | 54 ++++++++++++++++++++++++++++++++-----------
1 file changed, 41 insertions(+), 13 deletions(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index be66e5a67ee19..9a392822cd5a7 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -216,6 +216,36 @@ static const struct ref_scale_ops srcu_ops = {
.name = "srcu"
};
+static void srcu_lite_ref_scale_read_section(const int nloops)
+{
+ int i;
+ int idx;
+
+ for (i = nloops; i >= 0; i--) {
+ idx = srcu_read_lock_lite(srcu_ctlp);
+ srcu_read_unlock_lite(srcu_ctlp, idx);
+ }
+}
+
+static void srcu_lite_ref_scale_delay_section(const int nloops, const int udl, const int ndl)
+{
+ int i;
+ int idx;
+
+ for (i = nloops; i >= 0; i--) {
+ idx = srcu_read_lock_lite(srcu_ctlp);
+ un_delay(udl, ndl);
+ srcu_read_unlock_lite(srcu_ctlp, idx);
+ }
+}
+
+static const struct ref_scale_ops srcu_lite_ops = {
+ .init = rcu_sync_scale_init,
+ .readsection = srcu_lite_ref_scale_read_section,
+ .delaysection = srcu_lite_ref_scale_delay_section,
+ .name = "srcu-lite"
+};
+
#ifdef CONFIG_TASKS_RCU
// Definitions for RCU Tasks ref scale testing: Empty read markers.
@@ -1133,27 +1163,25 @@ ref_scale_init(void)
long i;
int firsterr = 0;
static const struct ref_scale_ops *scale_ops[] = {
- &rcu_ops, &srcu_ops, RCU_TRACE_OPS RCU_TASKS_OPS &refcnt_ops, &rwlock_ops,
- &rwsem_ops, &lock_ops, &lock_irq_ops, &acqrel_ops, &sched_clock_ops, &clock_ops,
- &jiffies_ops, &typesafe_ref_ops, &typesafe_lock_ops, &typesafe_seqlock_ops,
+ &rcu_ops, &srcu_ops, &srcu_lite_ops, RCU_TRACE_OPS RCU_TASKS_OPS
+ &refcnt_ops, &rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops, &acqrel_ops,
+ &sched_clock_ops, &clock_ops, &jiffies_ops, &typesafe_ref_ops, &typesafe_lock_ops,
+ &typesafe_seqlock_ops,
};
if (!torture_init_begin(scale_type, verbose))
return -EBUSY;
for (i = 0; i < ARRAY_SIZE(scale_ops); i++) {
- cur_ops = scale_ops[i];
- if (strcmp(scale_type, cur_ops->name) == 0)
+ cur_ops = scale_ops[i]; if (strcmp(scale_type,
+ cur_ops->name) == 0)
break;
- }
- if (i == ARRAY_SIZE(scale_ops)) {
- pr_alert("rcu-scale: invalid scale type: \"%s\"\n", scale_type);
- pr_alert("rcu-scale types:");
- for (i = 0; i < ARRAY_SIZE(scale_ops); i++)
+ } if (i == ARRAY_SIZE(scale_ops)) {
+ pr_alert("rcu-scale: invalid scale type: \"%s\"\n",
+ scale_type); pr_alert("rcu-scale types:"); for (i = 0;
+ i < ARRAY_SIZE(scale_ops); i++)
pr_cont(" %s", scale_ops[i]->name);
- pr_cont("\n");
- firsterr = -EINVAL;
- cur_ops = NULL;
+ pr_cont("\n"); firsterr = -EINVAL; cur_ops = NULL;
goto unwind;
}
if (cur_ops->init)
--
2.40.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()
2024-09-03 16:33 ` [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite() Paul E. McKenney
@ 2024-09-03 19:45 ` Alexei Starovoitov
2024-09-03 22:10 ` Paul E. McKenney
0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2024-09-03 19:45 UTC (permalink / raw)
To: Paul E. McKenney
Cc: rcu, LKML, Kernel Team, Steven Rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
On Tue, Sep 3, 2024 at 9:33 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 84daaa33ea0ab..4ba96e2cfa405 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
...
> +static inline int srcu_read_lock_lite(struct srcu_struct *ssp) __acquires(ssp)
> +{
> + int retval;
> +
> + srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_LITE);
> + retval = __srcu_read_lock_lite(ssp);
> + rcu_try_lock_acquire(&ssp->dep_map);
> + return retval;
> +}
...
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 602b4b8c4b891..bab888e86b9bb 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> +int __srcu_read_lock_lite(struct srcu_struct *ssp)
> +{
> + int idx;
> +
> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_lock_lite().");
> + idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> + this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); /* Y */
> + barrier(); /* Avoid leaking the critical section. */
> + return idx;
> +}
> +EXPORT_SYMBOL_GPL(__srcu_read_lock_lite);
The use cases where smp_mb() penalty is noticeable probably will notice
the cost of extra call too.
Can the main part be in srcu.h as well to make it truly "lite" ?
Otherwise we'd have to rely on compilers doing LTO which may or may not happen.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (10 preceding siblings ...)
2024-09-03 16:33 ` [PATCH rcu 11/11] refscale: Add srcu_read_lock_lite() support using "srcu-lite" Paul E. McKenney
@ 2024-09-03 21:38 ` Kent Overstreet
2024-09-03 22:13 ` Paul E. McKenney
2024-09-03 22:08 ` Andrii Nakryiko
2024-09-04 16:52 ` [PATCH rcu [12/11] srcu: Allow inlining of __srcu_read_{,un}lock_lite() Paul E. McKenney
13 siblings, 1 reply; 23+ messages in thread
From: Kent Overstreet @ 2024-09-03 21:38 UTC (permalink / raw)
To: Paul E. McKenney
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, bpf
On Tue, Sep 03, 2024 at 09:32:51AM GMT, Paul E. McKenney wrote:
> Hello!
>
> This series provides light-weight readers for SRCU. This lightness
> is selected by the caller by using the new srcu_read_lock_lite() and
> srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> srcu_read_unlock() flavors. Although this passes significant rcutorture
> testing, this should still be considered to be experimental.
This avoids memory barriers, correct?
> There are a few restrictions: (1) If srcu_read_lock_lite() is called
> on a given srcu_struct structure, then no other flavor may be used on
> that srcu_struct structure, before, during, or after. (2) The _lite()
> readers may only be invoked from regions of code where RCU is watching
> (as in those regions in which rcu_is_watching() returns true). (3)
> There is no auto-expediting for srcu_struct structures that have
> been passed to _lite() readers. (4) SRCU grace periods for _lite()
> srcu_struct structures invoke synchronize_rcu() at least twice, thus
> having longer latencies than their non-_lite() counterparts. (5) Even
> with synchronize_srcu_expedited(), the resulting SRCU grace period
> will invoke synchronize_rcu() at least twice, as opposed to invoking
> the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> from NMI handlers (that is what the _nmisafe() interface are for).
> Although one could imagine readers that were both _lite() and _nmisafe(),
> one might also imagine that the read-modify-write atomic operations that
> are needed by any NMI-safe SRCU read marker would make this unhelpful
> from a performance perspective.
So if I'm following, this should work fine for bcachefs, and be a nifty
small perforance boost.
Can I give you an account for my test cluster? If you'd like, we can
convert bcachefs to it and point it at your branch.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (11 preceding siblings ...)
2024-09-03 21:38 ` [PATCH rcu 0/11] Add light-weight readers for SRCU Kent Overstreet
@ 2024-09-03 22:08 ` Andrii Nakryiko
2024-09-03 22:20 ` Paul E. McKenney
2024-10-02 19:59 ` Andrii Nakryiko
2024-09-04 16:52 ` [PATCH rcu [12/11] srcu: Allow inlining of __srcu_read_{,un}lock_lite() Paul E. McKenney
13 siblings, 2 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-09-03 22:08 UTC (permalink / raw)
To: paulmck
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
On Tue, Sep 3, 2024 at 9:32 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> Hello!
>
> This series provides light-weight readers for SRCU. This lightness
> is selected by the caller by using the new srcu_read_lock_lite() and
> srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> srcu_read_unlock() flavors. Although this passes significant rcutorture
> testing, this should still be considered to be experimental.
>
> There are a few restrictions: (1) If srcu_read_lock_lite() is called
> on a given srcu_struct structure, then no other flavor may be used on
> that srcu_struct structure, before, during, or after. (2) The _lite()
> readers may only be invoked from regions of code where RCU is watching
> (as in those regions in which rcu_is_watching() returns true). (3)
> There is no auto-expediting for srcu_struct structures that have
> been passed to _lite() readers. (4) SRCU grace periods for _lite()
> srcu_struct structures invoke synchronize_rcu() at least twice, thus
> having longer latencies than their non-_lite() counterparts. (5) Even
> with synchronize_srcu_expedited(), the resulting SRCU grace period
> will invoke synchronize_rcu() at least twice, as opposed to invoking
> the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> from NMI handlers (that is what the _nmisafe() interface are for).
> Although one could imagine readers that were both _lite() and _nmisafe(),
> one might also imagine that the read-modify-write atomic operations that
> are needed by any NMI-safe SRCU read marker would make this unhelpful
> from a performance perspective.
>
> All that said, the patches in this series are as follows:
>
> 1. Rename srcu_might_be_idle() to srcu_should_expedite().
>
> 2. Introduce srcu_gp_is_expedited() helper function.
>
> 3. Renaming in preparation for additional reader flavor.
>
> 4. Bit manipulation changes for additional reader flavor.
>
> 5. Standardize srcu_data pointers to "sdp" and similar.
>
> 6. Convert srcu_data ->srcu_reader_flavor to bit field.
>
> 7. Add srcu_read_lock_lite() and srcu_read_unlock_lite().
>
> 8. rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits.
>
> 9. rcutorture: Add reader_flavor parameter for SRCU readers.
>
> 10. rcutorture: Add srcu_read_lock_lite() support to
> rcutorture.reader_flavor.
>
> 11. refscale: Add srcu_read_lock_lite() support using "srcu-lite".
>
> Thanx, Paul
>
Thanks Paul for working on this!
I applied your patches on top of all my uprobe changes (including the
RFC patches that remove locks, optimize VMA to inode resolution, etc,
etc; basically the fastest uprobe/uretprobe state I can get to). And
then tested a few changes:
- A) baseline (no SRCU-lite, RCU Tasks Trace for uprobe, normal SRCU
for uretprobes)
- B) A + SRCU-lite for uretprobes (i.e., SRCU to SRCU-lite conversion)
- C) B + RCU Tasks Trace converted to SRCU-lite
- D) I also pessimized baseline by reverting RCU Tasks Trace, so
both uprobes and uretprobes are SRCU protected. This allowed me to see
a pure gain of SRCU-lite over SRCU for uprobes, taking RCU Tasks Trace
performance out of the equation.
In uprobes I used basically two benchmarks. One, uprobe-nop, that
benchmarks entry uprobes (which are the fastest most optimized case,
using RCU Tasks Trace in A and SRCU in D), and another that benchmarks
return uprobes (uretprobes), called uretprobe-nop, which is normal
SRCU both in A) and D). The latter uretprobe-nop benchmark basically
combines entry and return probe overheads, because that's how
uretprobes work.
So, below are the most meaningful comparisons. First, SRCU vs
SRCU-lite for uretprobes:
BASELINE (A)
============
uretprobe-nop ( 1 cpus): 1.941 ± 0.002M/s ( 1.941M/s/cpu)
uretprobe-nop ( 2 cpus): 3.731 ± 0.001M/s ( 1.866M/s/cpu)
uretprobe-nop ( 3 cpus): 5.492 ± 0.002M/s ( 1.831M/s/cpu)
uretprobe-nop ( 4 cpus): 7.234 ± 0.003M/s ( 1.808M/s/cpu)
uretprobe-nop ( 8 cpus): 13.448 ± 0.098M/s ( 1.681M/s/cpu)
uretprobe-nop (16 cpus): 22.905 ± 0.009M/s ( 1.432M/s/cpu)
uretprobe-nop (32 cpus): 44.760 ± 0.069M/s ( 1.399M/s/cpu)
uretprobe-nop (40 cpus): 52.986 ± 0.104M/s ( 1.325M/s/cpu)
uretprobe-nop (64 cpus): 43.650 ± 0.435M/s ( 0.682M/s/cpu)
uretprobe-nop (80 cpus): 46.831 ± 0.938M/s ( 0.585M/s/cpu)
SRCU-lite for uretprobe (B)
===========================
uretprobe-nop ( 1 cpus): 2.014 ± 0.014M/s ( 2.014M/s/cpu)
uretprobe-nop ( 2 cpus): 3.820 ± 0.002M/s ( 1.910M/s/cpu)
uretprobe-nop ( 3 cpus): 5.640 ± 0.003M/s ( 1.880M/s/cpu)
uretprobe-nop ( 4 cpus): 7.410 ± 0.003M/s ( 1.852M/s/cpu)
uretprobe-nop ( 8 cpus): 13.877 ± 0.009M/s ( 1.735M/s/cpu)
uretprobe-nop (16 cpus): 23.372 ± 0.022M/s ( 1.461M/s/cpu)
uretprobe-nop (32 cpus): 45.748 ± 0.048M/s ( 1.430M/s/cpu)
uretprobe-nop (40 cpus): 54.327 ± 0.093M/s ( 1.358M/s/cpu)
uretprobe-nop (64 cpus): 43.672 ± 0.371M/s ( 0.682M/s/cpu)
uretprobe-nop (80 cpus): 47.470 ± 0.753M/s ( 0.593M/s/cpu)
You can see that across the board (except for noisy 64 CPU case)
SRCU-lite is faster.
Now, comparing A) vs C) on uprobe-nop, so we can see RCU Tasks Trace
vs SRCU-lite for uprobes.
BASELINE (A)
============
uprobe-nop ( 1 cpus): 3.574 ± 0.004M/s ( 3.574M/s/cpu)
uprobe-nop ( 2 cpus): 6.735 ± 0.006M/s ( 3.368M/s/cpu)
uprobe-nop ( 3 cpus): 10.102 ± 0.005M/s ( 3.367M/s/cpu)
uprobe-nop ( 4 cpus): 13.087 ± 0.008M/s ( 3.272M/s/cpu)
uprobe-nop ( 8 cpus): 24.622 ± 0.031M/s ( 3.078M/s/cpu)
uprobe-nop (16 cpus): 41.752 ± 0.020M/s ( 2.610M/s/cpu)
uprobe-nop (32 cpus): 84.973 ± 0.115M/s ( 2.655M/s/cpu)
uprobe-nop (40 cpus): 102.229 ± 0.030M/s ( 2.556M/s/cpu)
uprobe-nop (64 cpus): 125.537 ± 0.045M/s ( 1.962M/s/cpu)
uprobe-nop (80 cpus): 143.091 ± 0.044M/s ( 1.789M/s/cpu)
SRCU-lite for uprobes (C)
=========================
uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
Overall, RCU Tasks Trace beats SRCU-lite, which I think is expected,
so consider this just a confirmation. I'm not sure I'd like to switch
from RCU Tasks Trace to SRCU-lite for uprobes part, but at least we
have numbers to make that decision.
Finally, to see SRCU vs SRCU-lite for entry uprobes improvements
(i.e., if we never had RCU Tasks Trace). I've included a bit more
extensive set of CPU counts for completeness.
BASELINE w/ SRCU for uprobes (D)
================================
uprobe-nop ( 1 cpus): 3.413 ± 0.003M/s ( 3.413M/s/cpu)
uprobe-nop ( 2 cpus): 6.305 ± 0.003M/s ( 3.153M/s/cpu)
uprobe-nop ( 3 cpus): 9.442 ± 0.018M/s ( 3.147M/s/cpu)
uprobe-nop ( 4 cpus): 12.253 ± 0.006M/s ( 3.063M/s/cpu)
uprobe-nop ( 5 cpus): 15.316 ± 0.007M/s ( 3.063M/s/cpu)
uprobe-nop ( 6 cpus): 18.287 ± 0.030M/s ( 3.048M/s/cpu)
uprobe-nop ( 7 cpus): 21.378 ± 0.025M/s ( 3.054M/s/cpu)
uprobe-nop ( 8 cpus): 23.044 ± 0.010M/s ( 2.881M/s/cpu)
uprobe-nop (10 cpus): 28.778 ± 0.012M/s ( 2.878M/s/cpu)
uprobe-nop (12 cpus): 31.300 ± 0.016M/s ( 2.608M/s/cpu)
uprobe-nop (14 cpus): 36.580 ± 0.007M/s ( 2.613M/s/cpu)
uprobe-nop (16 cpus): 38.848 ± 0.017M/s ( 2.428M/s/cpu)
uprobe-nop (24 cpus): 60.298 ± 0.080M/s ( 2.512M/s/cpu)
uprobe-nop (32 cpus): 77.137 ± 1.957M/s ( 2.411M/s/cpu)
uprobe-nop (40 cpus): 89.205 ± 1.278M/s ( 2.230M/s/cpu)
uprobe-nop (48 cpus): 99.207 ± 0.444M/s ( 2.067M/s/cpu)
uprobe-nop (56 cpus): 102.399 ± 0.484M/s ( 1.829M/s/cpu)
uprobe-nop (64 cpus): 115.390 ± 0.972M/s ( 1.803M/s/cpu)
uprobe-nop (72 cpus): 127.476 ± 0.050M/s ( 1.770M/s/cpu)
uprobe-nop (80 cpus): 137.304 ± 0.068M/s ( 1.716M/s/cpu)
SRCU-lite for uprobes (C)
=========================
uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
uprobe-nop ( 5 cpus): 15.634 ± 0.008M/s ( 3.127M/s/cpu)
uprobe-nop ( 6 cpus): 18.443 ± 0.018M/s ( 3.074M/s/cpu)
uprobe-nop ( 7 cpus): 21.793 ± 0.057M/s ( 3.113M/s/cpu)
uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
uprobe-nop (10 cpus): 29.430 ± 0.021M/s ( 2.943M/s/cpu)
uprobe-nop (12 cpus): 32.035 ± 0.008M/s ( 2.670M/s/cpu)
uprobe-nop (14 cpus): 37.174 ± 0.046M/s ( 2.655M/s/cpu)
uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
uprobe-nop (24 cpus): 61.656 ± 0.187M/s ( 2.569M/s/cpu)
uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
uprobe-nop (48 cpus): 104.178 ± 0.033M/s ( 2.170M/s/cpu)
uprobe-nop (56 cpus): 105.689 ± 0.703M/s ( 1.887M/s/cpu)
uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
uprobe-nop (72 cpus): 127.574 ± 0.033M/s ( 1.772M/s/cpu)
uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
So, say, at 32 threads, we get 79.6 vs 77.1, which is about 3%
throughput win. Which is not negligible!
Note that as we get to 80 cores data is more noisy (hyperthreading,
background system noise, etc). But you can still see an improvement
across basically the entire range.
Hopefully the above data is useful.
> ------------------------------------------------------------------------
>
> Documentation/admin-guide/kernel-parameters.txt | 4
> b/Documentation/admin-guide/kernel-parameters.txt | 8 +
> b/include/linux/srcu.h | 21 +-
> b/include/linux/srcutree.h | 2
> b/kernel/rcu/rcutorture.c | 28 +--
> b/kernel/rcu/refscale.c | 54 +++++--
> b/kernel/rcu/srcutree.c | 16 +-
> include/linux/srcu.h | 86 +++++++++--
> include/linux/srcutree.h | 5
> kernel/rcu/rcutorture.c | 37 +++-
> kernel/rcu/srcutree.c | 168 +++++++++++++++-------
> 11 files changed, 308 insertions(+), 121 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()
2024-09-03 19:45 ` Alexei Starovoitov
@ 2024-09-03 22:10 ` Paul E. McKenney
0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 22:10 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: rcu, LKML, Kernel Team, Steven Rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
On Tue, Sep 03, 2024 at 12:45:23PM -0700, Alexei Starovoitov wrote:
> On Tue, Sep 3, 2024 at 9:33 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index 84daaa33ea0ab..4ba96e2cfa405 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> ...
>
> > +static inline int srcu_read_lock_lite(struct srcu_struct *ssp) __acquires(ssp)
> > +{
> > + int retval;
> > +
> > + srcu_check_read_flavor(ssp, SRCU_READ_FLAVOR_LITE);
> > + retval = __srcu_read_lock_lite(ssp);
> > + rcu_try_lock_acquire(&ssp->dep_map);
> > + return retval;
> > +}
>
> ...
>
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 602b4b8c4b891..bab888e86b9bb 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > +int __srcu_read_lock_lite(struct srcu_struct *ssp)
> > +{
> > + int idx;
> > +
> > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_lock_lite().");
> > + idx = READ_ONCE(ssp->srcu_idx) & 0x1;
> > + this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); /* Y */
> > + barrier(); /* Avoid leaking the critical section. */
> > + return idx;
> > +}
> > +EXPORT_SYMBOL_GPL(__srcu_read_lock_lite);
>
> The use cases where smp_mb() penalty is noticeable probably will notice
> the cost of extra call too.
> Can the main part be in srcu.h as well to make it truly "lite" ?
> Otherwise we'd have to rely on compilers doing LTO which may or may not happen.
I vaguely recall #include issues for old-times srcu_read_lock() and
srcu_read_unlock(), but I will try it and see what happens.
In the meantime, if you are too curious for your own good... ;-)
One way to check the performance is to work in the other direction. For
example, add ndelay(10) or similar to the current srcu_read_lock_lite()
implementation and seeing if this is visible at the system level.
Thanx, Paul
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 21:38 ` [PATCH rcu 0/11] Add light-weight readers for SRCU Kent Overstreet
@ 2024-09-03 22:13 ` Paul E. McKenney
2024-09-03 23:03 ` Kent Overstreet
0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 22:13 UTC (permalink / raw)
To: Kent Overstreet
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, bpf
On Tue, Sep 03, 2024 at 05:38:05PM -0400, Kent Overstreet wrote:
> On Tue, Sep 03, 2024 at 09:32:51AM GMT, Paul E. McKenney wrote:
> > Hello!
> >
> > This series provides light-weight readers for SRCU. This lightness
> > is selected by the caller by using the new srcu_read_lock_lite() and
> > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > srcu_read_unlock() flavors. Although this passes significant rcutorture
> > testing, this should still be considered to be experimental.
>
> This avoids memory barriers, correct?
Yes, there are no smp_mb() invocations in either srcu_read_lock_lite()
or srcu_read_unlock_lite(). As usual, nothing comes for free, so the
overhead is moved to the update side, and amplified, in the guise of
the at least two calls to synchronize_rcu().
> > There are a few restrictions: (1) If srcu_read_lock_lite() is called
> > on a given srcu_struct structure, then no other flavor may be used on
> > that srcu_struct structure, before, during, or after. (2) The _lite()
> > readers may only be invoked from regions of code where RCU is watching
> > (as in those regions in which rcu_is_watching() returns true). (3)
> > There is no auto-expediting for srcu_struct structures that have
> > been passed to _lite() readers. (4) SRCU grace periods for _lite()
> > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > having longer latencies than their non-_lite() counterparts. (5) Even
> > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > from NMI handlers (that is what the _nmisafe() interface are for).
> > Although one could imagine readers that were both _lite() and _nmisafe(),
> > one might also imagine that the read-modify-write atomic operations that
> > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > from a performance perspective.
>
> So if I'm following, this should work fine for bcachefs, and be a nifty
> small perforance boost.
Glad you like it!
> Can I give you an account for my test cluster? If you'd like, we can
> convert bcachefs to it and point it at your branch.
Thank you, but I will pass on more accounts. I have a fair amount of
hardware at my disposal. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 22:08 ` Andrii Nakryiko
@ 2024-09-03 22:20 ` Paul E. McKenney
2024-10-02 19:59 ` Andrii Nakryiko
1 sibling, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-03 22:20 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
On Tue, Sep 03, 2024 at 03:08:21PM -0700, Andrii Nakryiko wrote:
> On Tue, Sep 3, 2024 at 9:32 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Hello!
> >
> > This series provides light-weight readers for SRCU. This lightness
> > is selected by the caller by using the new srcu_read_lock_lite() and
> > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > srcu_read_unlock() flavors. Although this passes significant rcutorture
> > testing, this should still be considered to be experimental.
> >
> > There are a few restrictions: (1) If srcu_read_lock_lite() is called
> > on a given srcu_struct structure, then no other flavor may be used on
> > that srcu_struct structure, before, during, or after. (2) The _lite()
> > readers may only be invoked from regions of code where RCU is watching
> > (as in those regions in which rcu_is_watching() returns true). (3)
> > There is no auto-expediting for srcu_struct structures that have
> > been passed to _lite() readers. (4) SRCU grace periods for _lite()
> > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > having longer latencies than their non-_lite() counterparts. (5) Even
> > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > from NMI handlers (that is what the _nmisafe() interface are for).
> > Although one could imagine readers that were both _lite() and _nmisafe(),
> > one might also imagine that the read-modify-write atomic operations that
> > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > from a performance perspective.
> >
> > All that said, the patches in this series are as follows:
> >
> > 1. Rename srcu_might_be_idle() to srcu_should_expedite().
> >
> > 2. Introduce srcu_gp_is_expedited() helper function.
> >
> > 3. Renaming in preparation for additional reader flavor.
> >
> > 4. Bit manipulation changes for additional reader flavor.
> >
> > 5. Standardize srcu_data pointers to "sdp" and similar.
> >
> > 6. Convert srcu_data ->srcu_reader_flavor to bit field.
> >
> > 7. Add srcu_read_lock_lite() and srcu_read_unlock_lite().
> >
> > 8. rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits.
> >
> > 9. rcutorture: Add reader_flavor parameter for SRCU readers.
> >
> > 10. rcutorture: Add srcu_read_lock_lite() support to
> > rcutorture.reader_flavor.
> >
> > 11. refscale: Add srcu_read_lock_lite() support using "srcu-lite".
> >
> > Thanx, Paul
> >
>
> Thanks Paul for working on this!
>
> I applied your patches on top of all my uprobe changes (including the
> RFC patches that remove locks, optimize VMA to inode resolution, etc,
> etc; basically the fastest uprobe/uretprobe state I can get to). And
> then tested a few changes:
>
> - A) baseline (no SRCU-lite, RCU Tasks Trace for uprobe, normal SRCU
> for uretprobes)
> - B) A + SRCU-lite for uretprobes (i.e., SRCU to SRCU-lite conversion)
> - C) B + RCU Tasks Trace converted to SRCU-lite
> - D) I also pessimized baseline by reverting RCU Tasks Trace, so
> both uprobes and uretprobes are SRCU protected. This allowed me to see
> a pure gain of SRCU-lite over SRCU for uprobes, taking RCU Tasks Trace
> performance out of the equation.
>
> In uprobes I used basically two benchmarks. One, uprobe-nop, that
> benchmarks entry uprobes (which are the fastest most optimized case,
> using RCU Tasks Trace in A and SRCU in D), and another that benchmarks
> return uprobes (uretprobes), called uretprobe-nop, which is normal
> SRCU both in A) and D). The latter uretprobe-nop benchmark basically
> combines entry and return probe overheads, because that's how
> uretprobes work.
>
> So, below are the most meaningful comparisons. First, SRCU vs
> SRCU-lite for uretprobes:
>
> BASELINE (A)
> ============
> uretprobe-nop ( 1 cpus): 1.941 ± 0.002M/s ( 1.941M/s/cpu)
> uretprobe-nop ( 2 cpus): 3.731 ± 0.001M/s ( 1.866M/s/cpu)
> uretprobe-nop ( 3 cpus): 5.492 ± 0.002M/s ( 1.831M/s/cpu)
> uretprobe-nop ( 4 cpus): 7.234 ± 0.003M/s ( 1.808M/s/cpu)
> uretprobe-nop ( 8 cpus): 13.448 ± 0.098M/s ( 1.681M/s/cpu)
> uretprobe-nop (16 cpus): 22.905 ± 0.009M/s ( 1.432M/s/cpu)
> uretprobe-nop (32 cpus): 44.760 ± 0.069M/s ( 1.399M/s/cpu)
> uretprobe-nop (40 cpus): 52.986 ± 0.104M/s ( 1.325M/s/cpu)
> uretprobe-nop (64 cpus): 43.650 ± 0.435M/s ( 0.682M/s/cpu)
> uretprobe-nop (80 cpus): 46.831 ± 0.938M/s ( 0.585M/s/cpu)
>
> SRCU-lite for uretprobe (B)
> ===========================
> uretprobe-nop ( 1 cpus): 2.014 ± 0.014M/s ( 2.014M/s/cpu)
> uretprobe-nop ( 2 cpus): 3.820 ± 0.002M/s ( 1.910M/s/cpu)
> uretprobe-nop ( 3 cpus): 5.640 ± 0.003M/s ( 1.880M/s/cpu)
> uretprobe-nop ( 4 cpus): 7.410 ± 0.003M/s ( 1.852M/s/cpu)
> uretprobe-nop ( 8 cpus): 13.877 ± 0.009M/s ( 1.735M/s/cpu)
> uretprobe-nop (16 cpus): 23.372 ± 0.022M/s ( 1.461M/s/cpu)
> uretprobe-nop (32 cpus): 45.748 ± 0.048M/s ( 1.430M/s/cpu)
> uretprobe-nop (40 cpus): 54.327 ± 0.093M/s ( 1.358M/s/cpu)
> uretprobe-nop (64 cpus): 43.672 ± 0.371M/s ( 0.682M/s/cpu)
> uretprobe-nop (80 cpus): 47.470 ± 0.753M/s ( 0.593M/s/cpu)
>
> You can see that across the board (except for noisy 64 CPU case)
> SRCU-lite is faster.
>
>
> Now, comparing A) vs C) on uprobe-nop, so we can see RCU Tasks Trace
> vs SRCU-lite for uprobes.
>
> BASELINE (A)
> ============
> uprobe-nop ( 1 cpus): 3.574 ± 0.004M/s ( 3.574M/s/cpu)
> uprobe-nop ( 2 cpus): 6.735 ± 0.006M/s ( 3.368M/s/cpu)
> uprobe-nop ( 3 cpus): 10.102 ± 0.005M/s ( 3.367M/s/cpu)
> uprobe-nop ( 4 cpus): 13.087 ± 0.008M/s ( 3.272M/s/cpu)
> uprobe-nop ( 8 cpus): 24.622 ± 0.031M/s ( 3.078M/s/cpu)
> uprobe-nop (16 cpus): 41.752 ± 0.020M/s ( 2.610M/s/cpu)
> uprobe-nop (32 cpus): 84.973 ± 0.115M/s ( 2.655M/s/cpu)
> uprobe-nop (40 cpus): 102.229 ± 0.030M/s ( 2.556M/s/cpu)
> uprobe-nop (64 cpus): 125.537 ± 0.045M/s ( 1.962M/s/cpu)
> uprobe-nop (80 cpus): 143.091 ± 0.044M/s ( 1.789M/s/cpu)
>
> SRCU-lite for uprobes (C)
> =========================
> uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
> uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
> uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
> uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
> uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
> uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
> uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
> uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
> uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
> uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
>
>
> Overall, RCU Tasks Trace beats SRCU-lite, which I think is expected,
> so consider this just a confirmation. I'm not sure I'd like to switch
> from RCU Tasks Trace to SRCU-lite for uprobes part, but at least we
> have numbers to make that decision.
>
> Finally, to see SRCU vs SRCU-lite for entry uprobes improvements
> (i.e., if we never had RCU Tasks Trace). I've included a bit more
> extensive set of CPU counts for completeness.
>
> BASELINE w/ SRCU for uprobes (D)
> ================================
> uprobe-nop ( 1 cpus): 3.413 ± 0.003M/s ( 3.413M/s/cpu)
> uprobe-nop ( 2 cpus): 6.305 ± 0.003M/s ( 3.153M/s/cpu)
> uprobe-nop ( 3 cpus): 9.442 ± 0.018M/s ( 3.147M/s/cpu)
> uprobe-nop ( 4 cpus): 12.253 ± 0.006M/s ( 3.063M/s/cpu)
> uprobe-nop ( 5 cpus): 15.316 ± 0.007M/s ( 3.063M/s/cpu)
> uprobe-nop ( 6 cpus): 18.287 ± 0.030M/s ( 3.048M/s/cpu)
> uprobe-nop ( 7 cpus): 21.378 ± 0.025M/s ( 3.054M/s/cpu)
> uprobe-nop ( 8 cpus): 23.044 ± 0.010M/s ( 2.881M/s/cpu)
> uprobe-nop (10 cpus): 28.778 ± 0.012M/s ( 2.878M/s/cpu)
> uprobe-nop (12 cpus): 31.300 ± 0.016M/s ( 2.608M/s/cpu)
> uprobe-nop (14 cpus): 36.580 ± 0.007M/s ( 2.613M/s/cpu)
> uprobe-nop (16 cpus): 38.848 ± 0.017M/s ( 2.428M/s/cpu)
> uprobe-nop (24 cpus): 60.298 ± 0.080M/s ( 2.512M/s/cpu)
> uprobe-nop (32 cpus): 77.137 ± 1.957M/s ( 2.411M/s/cpu)
> uprobe-nop (40 cpus): 89.205 ± 1.278M/s ( 2.230M/s/cpu)
> uprobe-nop (48 cpus): 99.207 ± 0.444M/s ( 2.067M/s/cpu)
> uprobe-nop (56 cpus): 102.399 ± 0.484M/s ( 1.829M/s/cpu)
> uprobe-nop (64 cpus): 115.390 ± 0.972M/s ( 1.803M/s/cpu)
> uprobe-nop (72 cpus): 127.476 ± 0.050M/s ( 1.770M/s/cpu)
> uprobe-nop (80 cpus): 137.304 ± 0.068M/s ( 1.716M/s/cpu)
>
> SRCU-lite for uprobes (C)
> =========================
> uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
> uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
> uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
> uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
> uprobe-nop ( 5 cpus): 15.634 ± 0.008M/s ( 3.127M/s/cpu)
> uprobe-nop ( 6 cpus): 18.443 ± 0.018M/s ( 3.074M/s/cpu)
> uprobe-nop ( 7 cpus): 21.793 ± 0.057M/s ( 3.113M/s/cpu)
> uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
> uprobe-nop (10 cpus): 29.430 ± 0.021M/s ( 2.943M/s/cpu)
> uprobe-nop (12 cpus): 32.035 ± 0.008M/s ( 2.670M/s/cpu)
> uprobe-nop (14 cpus): 37.174 ± 0.046M/s ( 2.655M/s/cpu)
> uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
> uprobe-nop (24 cpus): 61.656 ± 0.187M/s ( 2.569M/s/cpu)
> uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
> uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
> uprobe-nop (48 cpus): 104.178 ± 0.033M/s ( 2.170M/s/cpu)
> uprobe-nop (56 cpus): 105.689 ± 0.703M/s ( 1.887M/s/cpu)
> uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
> uprobe-nop (72 cpus): 127.574 ± 0.033M/s ( 1.772M/s/cpu)
> uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
>
> So, say, at 32 threads, we get 79.6 vs 77.1, which is about 3%
> throughput win. Which is not negligible!
>
> Note that as we get to 80 cores data is more noisy (hyperthreading,
> background system noise, etc). But you can still see an improvement
> across basically the entire range.
>
> Hopefully the above data is useful.
Thank you very much for running this, Andrii!
And I agree that it is no surprise that Tasks Trace RCU is faster than
SRCU-lite, but I must confess that I never did expect SRCU's array
accesses to be free. And the ability to create multiple independent
instances of SRCU-lite might compensate in at least some cases.
Thanx, Paul
> > ------------------------------------------------------------------------
> >
> > Documentation/admin-guide/kernel-parameters.txt | 4
> > b/Documentation/admin-guide/kernel-parameters.txt | 8 +
> > b/include/linux/srcu.h | 21 +-
> > b/include/linux/srcutree.h | 2
> > b/kernel/rcu/rcutorture.c | 28 +--
> > b/kernel/rcu/refscale.c | 54 +++++--
> > b/kernel/rcu/srcutree.c | 16 +-
> > include/linux/srcu.h | 86 +++++++++--
> > include/linux/srcutree.h | 5
> > kernel/rcu/rcutorture.c | 37 +++-
> > kernel/rcu/srcutree.c | 168 +++++++++++++++-------
> > 11 files changed, 308 insertions(+), 121 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 22:13 ` Paul E. McKenney
@ 2024-09-03 23:03 ` Kent Overstreet
2024-09-04 10:54 ` Paul E. McKenney
0 siblings, 1 reply; 23+ messages in thread
From: Kent Overstreet @ 2024-09-03 23:03 UTC (permalink / raw)
To: Paul E. McKenney
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, bpf
On Tue, Sep 03, 2024 at 03:13:40PM GMT, Paul E. McKenney wrote:
> On Tue, Sep 03, 2024 at 05:38:05PM -0400, Kent Overstreet wrote:
> > On Tue, Sep 03, 2024 at 09:32:51AM GMT, Paul E. McKenney wrote:
> > > Hello!
> > >
> > > This series provides light-weight readers for SRCU. This lightness
> > > is selected by the caller by using the new srcu_read_lock_lite() and
> > > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > > srcu_read_unlock() flavors. Although this passes significant rcutorture
> > > testing, this should still be considered to be experimental.
> >
> > This avoids memory barriers, correct?
>
> Yes, there are no smp_mb() invocations in either srcu_read_lock_lite()
> or srcu_read_unlock_lite(). As usual, nothing comes for free, so the
> overhead is moved to the update side, and amplified, in the guise of
> the at least two calls to synchronize_rcu().
>
> > > There are a few restrictions: (1) If srcu_read_lock_lite() is called
> > > on a given srcu_struct structure, then no other flavor may be used on
> > > that srcu_struct structure, before, during, or after. (2) The _lite()
> > > readers may only be invoked from regions of code where RCU is watching
> > > (as in those regions in which rcu_is_watching() returns true). (3)
> > > There is no auto-expediting for srcu_struct structures that have
> > > been passed to _lite() readers. (4) SRCU grace periods for _lite()
> > > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > > having longer latencies than their non-_lite() counterparts. (5) Even
> > > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> > > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > > from NMI handlers (that is what the _nmisafe() interface are for).
> > > Although one could imagine readers that were both _lite() and _nmisafe(),
> > > one might also imagine that the read-modify-write atomic operations that
> > > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > > from a performance perspective.
> >
> > So if I'm following, this should work fine for bcachefs, and be a nifty
> > small perforance boost.
>
> Glad you like it!
>
> > Can I give you an account for my test cluster? If you'd like, we can
> > convert bcachefs to it and point it at your branch.
>
> Thank you, but I will pass on more accounts. I have a fair amount of
> hardware at my disposal. ;-)
Well - bcachefs might be a good torture test; if you patch bcachefs to
use the new API point me at a branch and I'll point the CI at it
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 23:03 ` Kent Overstreet
@ 2024-09-04 10:54 ` Paul E. McKenney
0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-04 10:54 UTC (permalink / raw)
To: Kent Overstreet
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, bpf
On Tue, Sep 03, 2024 at 07:03:53PM -0400, Kent Overstreet wrote:
> On Tue, Sep 03, 2024 at 03:13:40PM GMT, Paul E. McKenney wrote:
> > On Tue, Sep 03, 2024 at 05:38:05PM -0400, Kent Overstreet wrote:
> > > On Tue, Sep 03, 2024 at 09:32:51AM GMT, Paul E. McKenney wrote:
> > > > Hello!
> > > >
> > > > This series provides light-weight readers for SRCU. This lightness
> > > > is selected by the caller by using the new srcu_read_lock_lite() and
> > > > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > > > srcu_read_unlock() flavors. Although this passes significant rcutorture
> > > > testing, this should still be considered to be experimental.
> > >
> > > This avoids memory barriers, correct?
> >
> > Yes, there are no smp_mb() invocations in either srcu_read_lock_lite()
> > or srcu_read_unlock_lite(). As usual, nothing comes for free, so the
> > overhead is moved to the update side, and amplified, in the guise of
> > the at least two calls to synchronize_rcu().
> >
> > > > There are a few restrictions: (1) If srcu_read_lock_lite() is called
> > > > on a given srcu_struct structure, then no other flavor may be used on
> > > > that srcu_struct structure, before, during, or after. (2) The _lite()
> > > > readers may only be invoked from regions of code where RCU is watching
> > > > (as in those regions in which rcu_is_watching() returns true). (3)
> > > > There is no auto-expediting for srcu_struct structures that have
> > > > been passed to _lite() readers. (4) SRCU grace periods for _lite()
> > > > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > > > having longer latencies than their non-_lite() counterparts. (5) Even
> > > > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > > > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > > > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> > > > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > > > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > > > from NMI handlers (that is what the _nmisafe() interface are for).
> > > > Although one could imagine readers that were both _lite() and _nmisafe(),
> > > > one might also imagine that the read-modify-write atomic operations that
> > > > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > > > from a performance perspective.
> > >
> > > So if I'm following, this should work fine for bcachefs, and be a nifty
> > > small perforance boost.
> >
> > Glad you like it!
> >
> > > Can I give you an account for my test cluster? If you'd like, we can
> > > convert bcachefs to it and point it at your branch.
> >
> > Thank you, but I will pass on more accounts. I have a fair amount of
> > hardware at my disposal. ;-)
>
> Well - bcachefs might be a good torture test; if you patch bcachefs to
> use the new API point me at a branch and I'll point the CI at it
I am sorry, but I am getting on a plane today, am short of time, and
won't be responsive for the inevitable issues.
But this is a very simple change, just change srcu_read_lock() and
srcu_read_unlock() into srcu_read_lock_lite() and srcu_read_unlock_lite().
Would one of your bcachefs early adopters be up for this task?
Thanx, Paul
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH rcu [12/11] srcu: Allow inlining of __srcu_read_{,un}lock_lite()
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
` (12 preceding siblings ...)
2024-09-03 22:08 ` Andrii Nakryiko
@ 2024-09-04 16:52 ` Paul E. McKenney
13 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-09-04 16:52 UTC (permalink / raw)
To: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
srcu: Allow inlining of __srcu_read_{,un}lock_lite()
This commit moves __srcu_read_lock_lite() and __srcu_read_unlock_lite()
into include/linux/srcu.h and marks them "static inline" so that they
can be inlined into srcu_read_lock_lite() and srcu_read_unlock_lite(),
respectively. They are not hand-inlined due to Tree SRCU and Tiny SRCU
having different implementations.
Please note that this is early days for this series, and especially for
this particular patch. Moving these functions from the seclusion of
kernel/rcu/srcutree.c to the more widely exposed include/linux/srcutree.h
might have #include consequences on some yet-as-unsuspected architecture
or combination of Kconfig options.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: <bpf@vger.kernel.org>
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 8074138cbd624..778eb61542e18 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -209,4 +209,43 @@ void synchronize_srcu_expedited(struct srcu_struct *ssp);
void srcu_barrier(struct srcu_struct *ssp);
void srcu_torture_stats_print(struct srcu_struct *ssp, char *tt, char *tf);
+/*
+ * Counts the new reader in the appropriate per-CPU element of the
+ * srcu_struct. Returns an index that must be passed to the matching
+ * srcu_read_unlock_lite().
+ *
+ * Note that this_cpu_inc() is an RCU read-side critical section either
+ * because it disables interrupts, because it is a single instruction,
+ * or because it is a read-modify-write atomic operation, depending on
+ * the whims of the architecture.
+ */
+static inline int __srcu_read_lock_lite(struct srcu_struct *ssp)
+{
+ int idx;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_lock_lite().");
+ idx = READ_ONCE(ssp->srcu_idx) & 0x1;
+ this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); /* Y */
+ barrier(); /* Avoid leaking the critical section. */
+ return idx;
+}
+
+/*
+ * Removes the count for the old reader from the appropriate
+ * per-CPU element of the srcu_struct. Note that this may well be a
+ * different CPU than that which was incremented by the corresponding
+ * srcu_read_lock_lite(), but it must be within the same task.
+ *
+ * Note that this_cpu_inc() is an RCU read-side critical section either
+ * because it disables interrupts, because it is a single instruction,
+ * or because it is a read-modify-write atomic operation, depending on
+ * the whims of the architecture.
+ */
+static inline void __srcu_read_unlock_lite(struct srcu_struct *ssp, int idx)
+{
+ barrier(); /* Avoid leaking the critical section. */
+ this_cpu_inc(ssp->sda->srcu_unlock_count[idx].counter); /* Z */
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_unlock_lite().");
+}
+
#endif
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index bab888e86b9bb..124ad43c189d6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -763,47 +763,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
}
EXPORT_SYMBOL_GPL(__srcu_read_unlock);
-/*
- * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct. Returns an index that must be passed to the matching
- * srcu_read_unlock_lite().
- *
- * Note that this_cpu_inc() is an RCU read-side critical section either
- * because it disables interrupts, because it is a single instruction,
- * or because it is a read-modify-write atomic operation, depending on
- * the whims of the architecture.
- */
-int __srcu_read_lock_lite(struct srcu_struct *ssp)
-{
- int idx;
-
- RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_lock_lite().");
- idx = READ_ONCE(ssp->srcu_idx) & 0x1;
- this_cpu_inc(ssp->sda->srcu_lock_count[idx].counter); /* Y */
- barrier(); /* Avoid leaking the critical section. */
- return idx;
-}
-EXPORT_SYMBOL_GPL(__srcu_read_lock_lite);
-
-/*
- * Removes the count for the old reader from the appropriate
- * per-CPU element of the srcu_struct. Note that this may well be a
- * different CPU than that which was incremented by the corresponding
- * srcu_read_lock_lite(), but it must be within the same task.
- *
- * Note that this_cpu_inc() is an RCU read-side critical section either
- * because it disables interrupts, because it is a single instruction,
- * or because it is a read-modify-write atomic operation, depending on
- * the whims of the architecture.
- */
-void __srcu_read_unlock_lite(struct srcu_struct *ssp, int idx)
-{
- barrier(); /* Avoid leaking the critical section. */
- this_cpu_inc(ssp->sda->srcu_unlock_count[idx].counter); /* Z */
- RCU_LOCKDEP_WARN(!rcu_is_watching(), "RCU must be watching srcu_read_unlock_lite().");
-}
-EXPORT_SYMBOL_GPL(__srcu_read_unlock_lite);
-
#ifdef CONFIG_NEED_SRCU_NMI_SAFE
/*
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-09-03 22:08 ` Andrii Nakryiko
2024-09-03 22:20 ` Paul E. McKenney
@ 2024-10-02 19:59 ` Andrii Nakryiko
2024-10-03 16:09 ` Paul E. McKenney
1 sibling, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2024-10-02 19:59 UTC (permalink / raw)
To: paulmck
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
On Tue, Sep 3, 2024 at 3:08 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Sep 3, 2024 at 9:32 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Hello!
> >
> > This series provides light-weight readers for SRCU. This lightness
> > is selected by the caller by using the new srcu_read_lock_lite() and
> > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > srcu_read_unlock() flavors. Although this passes significant rcutorture
> > testing, this should still be considered to be experimental.
> >
> > There are a few restrictions: (1) If srcu_read_lock_lite() is called
> > on a given srcu_struct structure, then no other flavor may be used on
> > that srcu_struct structure, before, during, or after. (2) The _lite()
> > readers may only be invoked from regions of code where RCU is watching
> > (as in those regions in which rcu_is_watching() returns true). (3)
> > There is no auto-expediting for srcu_struct structures that have
> > been passed to _lite() readers. (4) SRCU grace periods for _lite()
> > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > having longer latencies than their non-_lite() counterparts. (5) Even
> > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > from NMI handlers (that is what the _nmisafe() interface are for).
> > Although one could imagine readers that were both _lite() and _nmisafe(),
> > one might also imagine that the read-modify-write atomic operations that
> > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > from a performance perspective.
> >
> > All that said, the patches in this series are as follows:
> >
> > 1. Rename srcu_might_be_idle() to srcu_should_expedite().
> >
> > 2. Introduce srcu_gp_is_expedited() helper function.
> >
> > 3. Renaming in preparation for additional reader flavor.
> >
> > 4. Bit manipulation changes for additional reader flavor.
> >
> > 5. Standardize srcu_data pointers to "sdp" and similar.
> >
> > 6. Convert srcu_data ->srcu_reader_flavor to bit field.
> >
> > 7. Add srcu_read_lock_lite() and srcu_read_unlock_lite().
> >
> > 8. rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits.
> >
> > 9. rcutorture: Add reader_flavor parameter for SRCU readers.
> >
> > 10. rcutorture: Add srcu_read_lock_lite() support to
> > rcutorture.reader_flavor.
> >
> > 11. refscale: Add srcu_read_lock_lite() support using "srcu-lite".
> >
> > Thanx, Paul
> >
>
> Thanks Paul for working on this!
>
> I applied your patches on top of all my uprobe changes (including the
> RFC patches that remove locks, optimize VMA to inode resolution, etc,
> etc; basically the fastest uprobe/uretprobe state I can get to). And
> then tested a few changes:
>
> - A) baseline (no SRCU-lite, RCU Tasks Trace for uprobe, normal SRCU
> for uretprobes)
> - B) A + SRCU-lite for uretprobes (i.e., SRCU to SRCU-lite conversion)
> - C) B + RCU Tasks Trace converted to SRCU-lite
> - D) I also pessimized baseline by reverting RCU Tasks Trace, so
> both uprobes and uretprobes are SRCU protected. This allowed me to see
> a pure gain of SRCU-lite over SRCU for uprobes, taking RCU Tasks Trace
> performance out of the equation.
>
> In uprobes I used basically two benchmarks. One, uprobe-nop, that
> benchmarks entry uprobes (which are the fastest most optimized case,
> using RCU Tasks Trace in A and SRCU in D), and another that benchmarks
> return uprobes (uretprobes), called uretprobe-nop, which is normal
> SRCU both in A) and D). The latter uretprobe-nop benchmark basically
> combines entry and return probe overheads, because that's how
> uretprobes work.
>
Ok, so I created B' and C' cases, which are just like B and C from
before, but each now uses inlined versions of SRCU-lite API. I also
re-ran the latest BASELINE, which I'll call A', just to make sure all
the results are compatible and based off of the same tip/perf/core
branch state (uretprobe performance significantly improved for >64
CPUs, I don't know exactly why, tbh). I'll augment benchmark results
below inline for easier comparison.
> So, below are the most meaningful comparisons. First, SRCU vs
> SRCU-lite for uretprobes:
>
> BASELINE (A)
> ============
> uretprobe-nop ( 1 cpus): 1.941 ± 0.002M/s ( 1.941M/s/cpu)
> uretprobe-nop ( 2 cpus): 3.731 ± 0.001M/s ( 1.866M/s/cpu)
> uretprobe-nop ( 3 cpus): 5.492 ± 0.002M/s ( 1.831M/s/cpu)
> uretprobe-nop ( 4 cpus): 7.234 ± 0.003M/s ( 1.808M/s/cpu)
> uretprobe-nop ( 8 cpus): 13.448 ± 0.098M/s ( 1.681M/s/cpu)
> uretprobe-nop (16 cpus): 22.905 ± 0.009M/s ( 1.432M/s/cpu)
> uretprobe-nop (32 cpus): 44.760 ± 0.069M/s ( 1.399M/s/cpu)
> uretprobe-nop (40 cpus): 52.986 ± 0.104M/s ( 1.325M/s/cpu)
> uretprobe-nop (64 cpus): 43.650 ± 0.435M/s ( 0.682M/s/cpu)
> uretprobe-nop (80 cpus): 46.831 ± 0.938M/s ( 0.585M/s/cpu)
>
> SRCU-lite for uretprobe (B)
> ===========================
> uretprobe-nop ( 1 cpus): 2.014 ± 0.014M/s ( 2.014M/s/cpu)
> uretprobe-nop ( 2 cpus): 3.820 ± 0.002M/s ( 1.910M/s/cpu)
> uretprobe-nop ( 3 cpus): 5.640 ± 0.003M/s ( 1.880M/s/cpu)
> uretprobe-nop ( 4 cpus): 7.410 ± 0.003M/s ( 1.852M/s/cpu)
> uretprobe-nop ( 8 cpus): 13.877 ± 0.009M/s ( 1.735M/s/cpu)
> uretprobe-nop (16 cpus): 23.372 ± 0.022M/s ( 1.461M/s/cpu)
> uretprobe-nop (32 cpus): 45.748 ± 0.048M/s ( 1.430M/s/cpu)
> uretprobe-nop (40 cpus): 54.327 ± 0.093M/s ( 1.358M/s/cpu)
> uretprobe-nop (64 cpus): 43.672 ± 0.371M/s ( 0.682M/s/cpu)
> uretprobe-nop (80 cpus): 47.470 ± 0.753M/s ( 0.593M/s/cpu)
>
NEW BASELINE (A')
=================
uretprobe-nop ( 1 cpus): 1.946 ± 0.001M/s ( 1.946M/s/cpu)
uretprobe-nop ( 2 cpus): 3.660 ± 0.002M/s ( 1.830M/s/cpu)
uretprobe-nop ( 3 cpus): 5.522 ± 0.002M/s ( 1.841M/s/cpu)
uretprobe-nop ( 4 cpus): 7.145 ± 0.001M/s ( 1.786M/s/cpu)
uretprobe-nop ( 8 cpus): 13.449 ± 0.004M/s ( 1.681M/s/cpu)
uretprobe-nop (16 cpus): 22.374 ± 0.008M/s ( 1.398M/s/cpu)
uretprobe-nop (32 cpus): 45.039 ± 0.011M/s ( 1.407M/s/cpu)
uretprobe-nop (40 cpus): 42.422 ± 0.073M/s ( 1.061M/s/cpu)
uretprobe-nop (64 cpus): 65.136 ± 0.084M/s ( 1.018M/s/cpu)
uretprobe-nop (80 cpus): 76.004 ± 0.066M/s ( 0.950M/s/cpu)
SRCU-lite for uretprobe (B')
============================
uretprobe-nop ( 1 cpus): 1.973 ± 0.001M/s ( 1.973M/s/cpu)
uretprobe-nop ( 2 cpus): 3.756 ± 0.002M/s ( 1.878M/s/cpu)
uretprobe-nop ( 3 cpus): 5.623 ± 0.003M/s ( 1.874M/s/cpu)
uretprobe-nop ( 4 cpus): 7.206 ± 0.029M/s ( 1.802M/s/cpu)
uretprobe-nop ( 8 cpus): 13.668 ± 0.004M/s ( 1.708M/s/cpu)
uretprobe-nop (16 cpus): 23.067 ± 0.016M/s ( 1.442M/s/cpu)
uretprobe-nop (32 cpus): 45.757 ± 0.030M/s ( 1.430M/s/cpu)
uretprobe-nop (40 cpus): 54.550 ± 0.035M/s ( 1.364M/s/cpu)
uretprobe-nop (64 cpus): 67.124 ± 0.057M/s ( 1.049M/s/cpu)
uretprobe-nop (80 cpus): 77.150 ± 0.158M/s ( 0.964M/s/cpu)
Inlining does help a bit, adding +200-300K/s in some cases.
> You can see that across the board (except for noisy 64 CPU case)
> SRCU-lite is faster.
>
>
> Now, comparing A) vs C) on uprobe-nop, so we can see RCU Tasks Trace
> vs SRCU-lite for uprobes.
>
> BASELINE (A)
> ============
> uprobe-nop ( 1 cpus): 3.574 ± 0.004M/s ( 3.574M/s/cpu)
> uprobe-nop ( 2 cpus): 6.735 ± 0.006M/s ( 3.368M/s/cpu)
> uprobe-nop ( 3 cpus): 10.102 ± 0.005M/s ( 3.367M/s/cpu)
> uprobe-nop ( 4 cpus): 13.087 ± 0.008M/s ( 3.272M/s/cpu)
> uprobe-nop ( 8 cpus): 24.622 ± 0.031M/s ( 3.078M/s/cpu)
> uprobe-nop (16 cpus): 41.752 ± 0.020M/s ( 2.610M/s/cpu)
> uprobe-nop (32 cpus): 84.973 ± 0.115M/s ( 2.655M/s/cpu)
> uprobe-nop (40 cpus): 102.229 ± 0.030M/s ( 2.556M/s/cpu)
> uprobe-nop (64 cpus): 125.537 ± 0.045M/s ( 1.962M/s/cpu)
> uprobe-nop (80 cpus): 143.091 ± 0.044M/s ( 1.789M/s/cpu)
>
> SRCU-lite for uprobes (C)
> =========================
> uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
> uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
> uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
> uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
> uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
> uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
> uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
> uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
> uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
> uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
>
NEW BASELINE (A')
=================
uprobe-nop ( 1 cpus): 3.480 ± 0.036M/s ( 3.480M/s/cpu)
uprobe-nop ( 2 cpus): 6.652 ± 0.026M/s ( 3.326M/s/cpu)
uprobe-nop ( 3 cpus): 10.050 ± 0.011M/s ( 3.350M/s/cpu)
uprobe-nop ( 4 cpus): 13.079 ± 0.008M/s ( 3.270M/s/cpu)
uprobe-nop ( 8 cpus): 24.620 ± 0.004M/s ( 3.077M/s/cpu)
uprobe-nop (16 cpus): 41.566 ± 0.030M/s ( 2.598M/s/cpu)
uprobe-nop (32 cpus): 77.314 ± 1.620M/s ( 2.416M/s/cpu)
uprobe-nop (40 cpus): 102.667 ± 0.047M/s ( 2.567M/s/cpu)
uprobe-nop (64 cpus): 126.298 ± 0.026M/s ( 1.973M/s/cpu)
uprobe-nop (80 cpus): 146.682 ± 0.035M/s ( 1.834M/s/cpu)
SRCU-lite for uprobes w/ inlining (C')
======================================
uprobe-nop ( 1 cpus): 3.444 ± 0.014M/s ( 3.444M/s/cpu)
uprobe-nop ( 2 cpus): 6.400 ± 0.021M/s ( 3.200M/s/cpu)
uprobe-nop ( 3 cpus): 9.568 ± 0.025M/s ( 3.189M/s/cpu)
uprobe-nop ( 4 cpus): 12.473 ± 0.020M/s ( 3.118M/s/cpu)
uprobe-nop ( 8 cpus): 23.552 ± 0.007M/s ( 2.944M/s/cpu)
uprobe-nop (16 cpus): 39.844 ± 0.016M/s ( 2.490M/s/cpu)
uprobe-nop (32 cpus): 78.667 ± 0.201M/s ( 2.458M/s/cpu)
uprobe-nop (40 cpus): 97.477 ± 0.094M/s ( 2.437M/s/cpu)
uprobe-nop (64 cpus): 119.472 ± 0.120M/s ( 1.867M/s/cpu)
uprobe-nop (80 cpus): 139.825 ± 0.042M/s ( 1.748M/s/cpu)
>
> Overall, RCU Tasks Trace beats SRCU-lite, which I think is expected,
> so consider this just a confirmation. I'm not sure I'd like to switch
> from RCU Tasks Trace to SRCU-lite for uprobes part, but at least we
> have numbers to make that decision.
>
> Finally, to see SRCU vs SRCU-lite for entry uprobes improvements
> (i.e., if we never had RCU Tasks Trace). I've included a bit more
> extensive set of CPU counts for completeness.
>
> BASELINE w/ SRCU for uprobes (D)
> ================================
> uprobe-nop ( 1 cpus): 3.413 ± 0.003M/s ( 3.413M/s/cpu)
> uprobe-nop ( 2 cpus): 6.305 ± 0.003M/s ( 3.153M/s/cpu)
> uprobe-nop ( 3 cpus): 9.442 ± 0.018M/s ( 3.147M/s/cpu)
> uprobe-nop ( 4 cpus): 12.253 ± 0.006M/s ( 3.063M/s/cpu)
> uprobe-nop ( 5 cpus): 15.316 ± 0.007M/s ( 3.063M/s/cpu)
> uprobe-nop ( 6 cpus): 18.287 ± 0.030M/s ( 3.048M/s/cpu)
> uprobe-nop ( 7 cpus): 21.378 ± 0.025M/s ( 3.054M/s/cpu)
> uprobe-nop ( 8 cpus): 23.044 ± 0.010M/s ( 2.881M/s/cpu)
> uprobe-nop (10 cpus): 28.778 ± 0.012M/s ( 2.878M/s/cpu)
> uprobe-nop (12 cpus): 31.300 ± 0.016M/s ( 2.608M/s/cpu)
> uprobe-nop (14 cpus): 36.580 ± 0.007M/s ( 2.613M/s/cpu)
> uprobe-nop (16 cpus): 38.848 ± 0.017M/s ( 2.428M/s/cpu)
> uprobe-nop (24 cpus): 60.298 ± 0.080M/s ( 2.512M/s/cpu)
> uprobe-nop (32 cpus): 77.137 ± 1.957M/s ( 2.411M/s/cpu)
> uprobe-nop (40 cpus): 89.205 ± 1.278M/s ( 2.230M/s/cpu)
> uprobe-nop (48 cpus): 99.207 ± 0.444M/s ( 2.067M/s/cpu)
> uprobe-nop (56 cpus): 102.399 ± 0.484M/s ( 1.829M/s/cpu)
> uprobe-nop (64 cpus): 115.390 ± 0.972M/s ( 1.803M/s/cpu)
> uprobe-nop (72 cpus): 127.476 ± 0.050M/s ( 1.770M/s/cpu)
> uprobe-nop (80 cpus): 137.304 ± 0.068M/s ( 1.716M/s/cpu)
>
> SRCU-lite for uprobes (C)
> =========================
> uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
> uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
> uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
> uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
> uprobe-nop ( 5 cpus): 15.634 ± 0.008M/s ( 3.127M/s/cpu)
> uprobe-nop ( 6 cpus): 18.443 ± 0.018M/s ( 3.074M/s/cpu)
> uprobe-nop ( 7 cpus): 21.793 ± 0.057M/s ( 3.113M/s/cpu)
> uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
> uprobe-nop (10 cpus): 29.430 ± 0.021M/s ( 2.943M/s/cpu)
> uprobe-nop (12 cpus): 32.035 ± 0.008M/s ( 2.670M/s/cpu)
> uprobe-nop (14 cpus): 37.174 ± 0.046M/s ( 2.655M/s/cpu)
> uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
> uprobe-nop (24 cpus): 61.656 ± 0.187M/s ( 2.569M/s/cpu)
> uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
> uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
> uprobe-nop (48 cpus): 104.178 ± 0.033M/s ( 2.170M/s/cpu)
> uprobe-nop (56 cpus): 105.689 ± 0.703M/s ( 1.887M/s/cpu)
> uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
> uprobe-nop (72 cpus): 127.574 ± 0.033M/s ( 1.772M/s/cpu)
> uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
>
> So, say, at 32 threads, we get 79.6 vs 77.1, which is about 3%
> throughput win. Which is not negligible!
>
> Note that as we get to 80 cores data is more noisy (hyperthreading,
> background system noise, etc). But you can still see an improvement
> across basically the entire range.
>
> Hopefully the above data is useful.
>
> > ------------------------------------------------------------------------
> >
> > Documentation/admin-guide/kernel-parameters.txt | 4
> > b/Documentation/admin-guide/kernel-parameters.txt | 8 +
> > b/include/linux/srcu.h | 21 +-
> > b/include/linux/srcutree.h | 2
> > b/kernel/rcu/rcutorture.c | 28 +--
> > b/kernel/rcu/refscale.c | 54 +++++--
> > b/kernel/rcu/srcutree.c | 16 +-
> > include/linux/srcu.h | 86 +++++++++--
> > include/linux/srcutree.h | 5
> > kernel/rcu/rcutorture.c | 37 +++-
> > kernel/rcu/srcutree.c | 168 +++++++++++++++-------
> > 11 files changed, 308 insertions(+), 121 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH rcu 0/11] Add light-weight readers for SRCU
2024-10-02 19:59 ` Andrii Nakryiko
@ 2024-10-03 16:09 ` Paul E. McKenney
0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2024-10-03 16:09 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: rcu, linux-kernel, kernel-team, rostedt, Alexei Starovoitov,
Andrii Nakryiko, Peter Zijlstra, Kent Overstreet, bpf
On Wed, Oct 02, 2024 at 12:59:55PM -0700, Andrii Nakryiko wrote:
> On Tue, Sep 3, 2024 at 3:08 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Sep 3, 2024 at 9:32 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > Hello!
> > >
> > > This series provides light-weight readers for SRCU. This lightness
> > > is selected by the caller by using the new srcu_read_lock_lite() and
> > > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > > srcu_read_unlock() flavors. Although this passes significant rcutorture
> > > testing, this should still be considered to be experimental.
> > >
> > > There are a few restrictions: (1) If srcu_read_lock_lite() is called
> > > on a given srcu_struct structure, then no other flavor may be used on
> > > that srcu_struct structure, before, during, or after. (2) The _lite()
> > > readers may only be invoked from regions of code where RCU is watching
> > > (as in those regions in which rcu_is_watching() returns true). (3)
> > > There is no auto-expediting for srcu_struct structures that have
> > > been passed to _lite() readers. (4) SRCU grace periods for _lite()
> > > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > > having longer latencies than their non-_lite() counterparts. (5) Even
> > > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > > the IPI-happy synchronize_rcu_expedited() function. (6) Just as with
> > > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > > from NMI handlers (that is what the _nmisafe() interface are for).
> > > Although one could imagine readers that were both _lite() and _nmisafe(),
> > > one might also imagine that the read-modify-write atomic operations that
> > > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > > from a performance perspective.
> > >
> > > All that said, the patches in this series are as follows:
> > >
> > > 1. Rename srcu_might_be_idle() to srcu_should_expedite().
> > >
> > > 2. Introduce srcu_gp_is_expedited() helper function.
> > >
> > > 3. Renaming in preparation for additional reader flavor.
> > >
> > > 4. Bit manipulation changes for additional reader flavor.
> > >
> > > 5. Standardize srcu_data pointers to "sdp" and similar.
> > >
> > > 6. Convert srcu_data ->srcu_reader_flavor to bit field.
> > >
> > > 7. Add srcu_read_lock_lite() and srcu_read_unlock_lite().
> > >
> > > 8. rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits.
> > >
> > > 9. rcutorture: Add reader_flavor parameter for SRCU readers.
> > >
> > > 10. rcutorture: Add srcu_read_lock_lite() support to
> > > rcutorture.reader_flavor.
> > >
> > > 11. refscale: Add srcu_read_lock_lite() support using "srcu-lite".
> > >
> > > Thanx, Paul
> > >
> >
> > Thanks Paul for working on this!
> >
> > I applied your patches on top of all my uprobe changes (including the
> > RFC patches that remove locks, optimize VMA to inode resolution, etc,
> > etc; basically the fastest uprobe/uretprobe state I can get to). And
> > then tested a few changes:
> >
> > - A) baseline (no SRCU-lite, RCU Tasks Trace for uprobe, normal SRCU
> > for uretprobes)
> > - B) A + SRCU-lite for uretprobes (i.e., SRCU to SRCU-lite conversion)
> > - C) B + RCU Tasks Trace converted to SRCU-lite
> > - D) I also pessimized baseline by reverting RCU Tasks Trace, so
> > both uprobes and uretprobes are SRCU protected. This allowed me to see
> > a pure gain of SRCU-lite over SRCU for uprobes, taking RCU Tasks Trace
> > performance out of the equation.
> >
> > In uprobes I used basically two benchmarks. One, uprobe-nop, that
> > benchmarks entry uprobes (which are the fastest most optimized case,
> > using RCU Tasks Trace in A and SRCU in D), and another that benchmarks
> > return uprobes (uretprobes), called uretprobe-nop, which is normal
> > SRCU both in A) and D). The latter uretprobe-nop benchmark basically
> > combines entry and return probe overheads, because that's how
> > uretprobes work.
> >
>
> Ok, so I created B' and C' cases, which are just like B and C from
> before, but each now uses inlined versions of SRCU-lite API. I also
> re-ran the latest BASELINE, which I'll call A', just to make sure all
> the results are compatible and based off of the same tip/perf/core
> branch state (uretprobe performance significantly improved for >64
> CPUs, I don't know exactly why, tbh). I'll augment benchmark results
> below inline for easier comparison.
>
> > So, below are the most meaningful comparisons. First, SRCU vs
> > SRCU-lite for uretprobes:
> >
> > BASELINE (A)
> > ============
> > uretprobe-nop ( 1 cpus): 1.941 ± 0.002M/s ( 1.941M/s/cpu)
> > uretprobe-nop ( 2 cpus): 3.731 ± 0.001M/s ( 1.866M/s/cpu)
> > uretprobe-nop ( 3 cpus): 5.492 ± 0.002M/s ( 1.831M/s/cpu)
> > uretprobe-nop ( 4 cpus): 7.234 ± 0.003M/s ( 1.808M/s/cpu)
> > uretprobe-nop ( 8 cpus): 13.448 ± 0.098M/s ( 1.681M/s/cpu)
> > uretprobe-nop (16 cpus): 22.905 ± 0.009M/s ( 1.432M/s/cpu)
> > uretprobe-nop (32 cpus): 44.760 ± 0.069M/s ( 1.399M/s/cpu)
> > uretprobe-nop (40 cpus): 52.986 ± 0.104M/s ( 1.325M/s/cpu)
> > uretprobe-nop (64 cpus): 43.650 ± 0.435M/s ( 0.682M/s/cpu)
> > uretprobe-nop (80 cpus): 46.831 ± 0.938M/s ( 0.585M/s/cpu)
> >
> > SRCU-lite for uretprobe (B)
> > ===========================
> > uretprobe-nop ( 1 cpus): 2.014 ± 0.014M/s ( 2.014M/s/cpu)
> > uretprobe-nop ( 2 cpus): 3.820 ± 0.002M/s ( 1.910M/s/cpu)
> > uretprobe-nop ( 3 cpus): 5.640 ± 0.003M/s ( 1.880M/s/cpu)
> > uretprobe-nop ( 4 cpus): 7.410 ± 0.003M/s ( 1.852M/s/cpu)
> > uretprobe-nop ( 8 cpus): 13.877 ± 0.009M/s ( 1.735M/s/cpu)
> > uretprobe-nop (16 cpus): 23.372 ± 0.022M/s ( 1.461M/s/cpu)
> > uretprobe-nop (32 cpus): 45.748 ± 0.048M/s ( 1.430M/s/cpu)
> > uretprobe-nop (40 cpus): 54.327 ± 0.093M/s ( 1.358M/s/cpu)
> > uretprobe-nop (64 cpus): 43.672 ± 0.371M/s ( 0.682M/s/cpu)
> > uretprobe-nop (80 cpus): 47.470 ± 0.753M/s ( 0.593M/s/cpu)
> >
>
> NEW BASELINE (A')
> =================
> uretprobe-nop ( 1 cpus): 1.946 ± 0.001M/s ( 1.946M/s/cpu)
> uretprobe-nop ( 2 cpus): 3.660 ± 0.002M/s ( 1.830M/s/cpu)
> uretprobe-nop ( 3 cpus): 5.522 ± 0.002M/s ( 1.841M/s/cpu)
> uretprobe-nop ( 4 cpus): 7.145 ± 0.001M/s ( 1.786M/s/cpu)
> uretprobe-nop ( 8 cpus): 13.449 ± 0.004M/s ( 1.681M/s/cpu)
> uretprobe-nop (16 cpus): 22.374 ± 0.008M/s ( 1.398M/s/cpu)
> uretprobe-nop (32 cpus): 45.039 ± 0.011M/s ( 1.407M/s/cpu)
> uretprobe-nop (40 cpus): 42.422 ± 0.073M/s ( 1.061M/s/cpu)
> uretprobe-nop (64 cpus): 65.136 ± 0.084M/s ( 1.018M/s/cpu)
> uretprobe-nop (80 cpus): 76.004 ± 0.066M/s ( 0.950M/s/cpu)
>
> SRCU-lite for uretprobe (B')
> ============================
> uretprobe-nop ( 1 cpus): 1.973 ± 0.001M/s ( 1.973M/s/cpu)
> uretprobe-nop ( 2 cpus): 3.756 ± 0.002M/s ( 1.878M/s/cpu)
> uretprobe-nop ( 3 cpus): 5.623 ± 0.003M/s ( 1.874M/s/cpu)
> uretprobe-nop ( 4 cpus): 7.206 ± 0.029M/s ( 1.802M/s/cpu)
> uretprobe-nop ( 8 cpus): 13.668 ± 0.004M/s ( 1.708M/s/cpu)
> uretprobe-nop (16 cpus): 23.067 ± 0.016M/s ( 1.442M/s/cpu)
> uretprobe-nop (32 cpus): 45.757 ± 0.030M/s ( 1.430M/s/cpu)
> uretprobe-nop (40 cpus): 54.550 ± 0.035M/s ( 1.364M/s/cpu)
> uretprobe-nop (64 cpus): 67.124 ± 0.057M/s ( 1.049M/s/cpu)
> uretprobe-nop (80 cpus): 77.150 ± 0.158M/s ( 0.964M/s/cpu)
>
> Inlining does help a bit, adding +200-300K/s in some cases.
Thank you for testing this! It seems compelling enough for me to send
this into the next merge window along with the base support, then.
Thanx, Paul
> > You can see that across the board (except for noisy 64 CPU case)
> > SRCU-lite is faster.
> >
> >
> > Now, comparing A) vs C) on uprobe-nop, so we can see RCU Tasks Trace
> > vs SRCU-lite for uprobes.
> >
> > BASELINE (A)
> > ============
> > uprobe-nop ( 1 cpus): 3.574 ± 0.004M/s ( 3.574M/s/cpu)
> > uprobe-nop ( 2 cpus): 6.735 ± 0.006M/s ( 3.368M/s/cpu)
> > uprobe-nop ( 3 cpus): 10.102 ± 0.005M/s ( 3.367M/s/cpu)
> > uprobe-nop ( 4 cpus): 13.087 ± 0.008M/s ( 3.272M/s/cpu)
> > uprobe-nop ( 8 cpus): 24.622 ± 0.031M/s ( 3.078M/s/cpu)
> > uprobe-nop (16 cpus): 41.752 ± 0.020M/s ( 2.610M/s/cpu)
> > uprobe-nop (32 cpus): 84.973 ± 0.115M/s ( 2.655M/s/cpu)
> > uprobe-nop (40 cpus): 102.229 ± 0.030M/s ( 2.556M/s/cpu)
> > uprobe-nop (64 cpus): 125.537 ± 0.045M/s ( 1.962M/s/cpu)
> > uprobe-nop (80 cpus): 143.091 ± 0.044M/s ( 1.789M/s/cpu)
> >
> > SRCU-lite for uprobes (C)
> > =========================
> > uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
> > uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
> > uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
> > uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
> > uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
> > uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
> > uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
> > uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
> > uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
> > uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
> >
>
> NEW BASELINE (A')
> =================
> uprobe-nop ( 1 cpus): 3.480 ± 0.036M/s ( 3.480M/s/cpu)
> uprobe-nop ( 2 cpus): 6.652 ± 0.026M/s ( 3.326M/s/cpu)
> uprobe-nop ( 3 cpus): 10.050 ± 0.011M/s ( 3.350M/s/cpu)
> uprobe-nop ( 4 cpus): 13.079 ± 0.008M/s ( 3.270M/s/cpu)
> uprobe-nop ( 8 cpus): 24.620 ± 0.004M/s ( 3.077M/s/cpu)
> uprobe-nop (16 cpus): 41.566 ± 0.030M/s ( 2.598M/s/cpu)
> uprobe-nop (32 cpus): 77.314 ± 1.620M/s ( 2.416M/s/cpu)
> uprobe-nop (40 cpus): 102.667 ± 0.047M/s ( 2.567M/s/cpu)
> uprobe-nop (64 cpus): 126.298 ± 0.026M/s ( 1.973M/s/cpu)
> uprobe-nop (80 cpus): 146.682 ± 0.035M/s ( 1.834M/s/cpu)
>
> SRCU-lite for uprobes w/ inlining (C')
> ======================================
> uprobe-nop ( 1 cpus): 3.444 ± 0.014M/s ( 3.444M/s/cpu)
> uprobe-nop ( 2 cpus): 6.400 ± 0.021M/s ( 3.200M/s/cpu)
> uprobe-nop ( 3 cpus): 9.568 ± 0.025M/s ( 3.189M/s/cpu)
> uprobe-nop ( 4 cpus): 12.473 ± 0.020M/s ( 3.118M/s/cpu)
> uprobe-nop ( 8 cpus): 23.552 ± 0.007M/s ( 2.944M/s/cpu)
> uprobe-nop (16 cpus): 39.844 ± 0.016M/s ( 2.490M/s/cpu)
> uprobe-nop (32 cpus): 78.667 ± 0.201M/s ( 2.458M/s/cpu)
> uprobe-nop (40 cpus): 97.477 ± 0.094M/s ( 2.437M/s/cpu)
> uprobe-nop (64 cpus): 119.472 ± 0.120M/s ( 1.867M/s/cpu)
> uprobe-nop (80 cpus): 139.825 ± 0.042M/s ( 1.748M/s/cpu)
>
> >
> > Overall, RCU Tasks Trace beats SRCU-lite, which I think is expected,
> > so consider this just a confirmation. I'm not sure I'd like to switch
> > from RCU Tasks Trace to SRCU-lite for uprobes part, but at least we
> > have numbers to make that decision.
> >
> > Finally, to see SRCU vs SRCU-lite for entry uprobes improvements
> > (i.e., if we never had RCU Tasks Trace). I've included a bit more
> > extensive set of CPU counts for completeness.
> >
> > BASELINE w/ SRCU for uprobes (D)
> > ================================
> > uprobe-nop ( 1 cpus): 3.413 ± 0.003M/s ( 3.413M/s/cpu)
> > uprobe-nop ( 2 cpus): 6.305 ± 0.003M/s ( 3.153M/s/cpu)
> > uprobe-nop ( 3 cpus): 9.442 ± 0.018M/s ( 3.147M/s/cpu)
> > uprobe-nop ( 4 cpus): 12.253 ± 0.006M/s ( 3.063M/s/cpu)
> > uprobe-nop ( 5 cpus): 15.316 ± 0.007M/s ( 3.063M/s/cpu)
> > uprobe-nop ( 6 cpus): 18.287 ± 0.030M/s ( 3.048M/s/cpu)
> > uprobe-nop ( 7 cpus): 21.378 ± 0.025M/s ( 3.054M/s/cpu)
> > uprobe-nop ( 8 cpus): 23.044 ± 0.010M/s ( 2.881M/s/cpu)
> > uprobe-nop (10 cpus): 28.778 ± 0.012M/s ( 2.878M/s/cpu)
> > uprobe-nop (12 cpus): 31.300 ± 0.016M/s ( 2.608M/s/cpu)
> > uprobe-nop (14 cpus): 36.580 ± 0.007M/s ( 2.613M/s/cpu)
> > uprobe-nop (16 cpus): 38.848 ± 0.017M/s ( 2.428M/s/cpu)
> > uprobe-nop (24 cpus): 60.298 ± 0.080M/s ( 2.512M/s/cpu)
> > uprobe-nop (32 cpus): 77.137 ± 1.957M/s ( 2.411M/s/cpu)
> > uprobe-nop (40 cpus): 89.205 ± 1.278M/s ( 2.230M/s/cpu)
> > uprobe-nop (48 cpus): 99.207 ± 0.444M/s ( 2.067M/s/cpu)
> > uprobe-nop (56 cpus): 102.399 ± 0.484M/s ( 1.829M/s/cpu)
> > uprobe-nop (64 cpus): 115.390 ± 0.972M/s ( 1.803M/s/cpu)
> > uprobe-nop (72 cpus): 127.476 ± 0.050M/s ( 1.770M/s/cpu)
> > uprobe-nop (80 cpus): 137.304 ± 0.068M/s ( 1.716M/s/cpu)
> >
> > SRCU-lite for uprobes (C)
> > =========================
> > uprobe-nop ( 1 cpus): 3.446 ± 0.010M/s ( 3.446M/s/cpu)
> > uprobe-nop ( 2 cpus): 6.411 ± 0.003M/s ( 3.206M/s/cpu)
> > uprobe-nop ( 3 cpus): 9.563 ± 0.039M/s ( 3.188M/s/cpu)
> > uprobe-nop ( 4 cpus): 12.454 ± 0.016M/s ( 3.113M/s/cpu)
> > uprobe-nop ( 5 cpus): 15.634 ± 0.008M/s ( 3.127M/s/cpu)
> > uprobe-nop ( 6 cpus): 18.443 ± 0.018M/s ( 3.074M/s/cpu)
> > uprobe-nop ( 7 cpus): 21.793 ± 0.057M/s ( 3.113M/s/cpu)
> > uprobe-nop ( 8 cpus): 23.172 ± 0.013M/s ( 2.897M/s/cpu)
> > uprobe-nop (10 cpus): 29.430 ± 0.021M/s ( 2.943M/s/cpu)
> > uprobe-nop (12 cpus): 32.035 ± 0.008M/s ( 2.670M/s/cpu)
> > uprobe-nop (14 cpus): 37.174 ± 0.046M/s ( 2.655M/s/cpu)
> > uprobe-nop (16 cpus): 39.793 ± 0.005M/s ( 2.487M/s/cpu)
> > uprobe-nop (24 cpus): 61.656 ± 0.187M/s ( 2.569M/s/cpu)
> > uprobe-nop (32 cpus): 79.616 ± 0.207M/s ( 2.488M/s/cpu)
> > uprobe-nop (40 cpus): 96.851 ± 0.128M/s ( 2.421M/s/cpu)
> > uprobe-nop (48 cpus): 104.178 ± 0.033M/s ( 2.170M/s/cpu)
> > uprobe-nop (56 cpus): 105.689 ± 0.703M/s ( 1.887M/s/cpu)
> > uprobe-nop (64 cpus): 119.432 ± 0.146M/s ( 1.866M/s/cpu)
> > uprobe-nop (72 cpus): 127.574 ± 0.033M/s ( 1.772M/s/cpu)
> > uprobe-nop (80 cpus): 135.162 ± 0.207M/s ( 1.690M/s/cpu)
> >
> > So, say, at 32 threads, we get 79.6 vs 77.1, which is about 3%
> > throughput win. Which is not negligible!
> >
> > Note that as we get to 80 cores data is more noisy (hyperthreading,
> > background system noise, etc). But you can still see an improvement
> > across basically the entire range.
> >
> > Hopefully the above data is useful.
> >
> > > ------------------------------------------------------------------------
> > >
> > > Documentation/admin-guide/kernel-parameters.txt | 4
> > > b/Documentation/admin-guide/kernel-parameters.txt | 8 +
> > > b/include/linux/srcu.h | 21 +-
> > > b/include/linux/srcutree.h | 2
> > > b/kernel/rcu/rcutorture.c | 28 +--
> > > b/kernel/rcu/refscale.c | 54 +++++--
> > > b/kernel/rcu/srcutree.c | 16 +-
> > > include/linux/srcu.h | 86 +++++++++--
> > > include/linux/srcutree.h | 5
> > > kernel/rcu/rcutorture.c | 37 +++-
> > > kernel/rcu/srcutree.c | 168 +++++++++++++++-------
> > > 11 files changed, 308 insertions(+), 121 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-10-03 16:09 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 16:32 [PATCH rcu 0/11] Add light-weight readers for SRCU Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 01/11] srcu: Rename srcu_might_be_idle() to srcu_should_expedite() Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 02/11] srcu: Introduce srcu_gp_is_expedited() helper function Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 03/11] srcu: Renaming in preparation for additional reader flavor Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 04/11] srcu: Bit manipulation changes " Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 05/11] srcu: Standardize srcu_data pointers to "sdp" and similar Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 06/11] srcu: Convert srcu_data ->srcu_reader_flavor to bit field Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 07/11] srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite() Paul E. McKenney
2024-09-03 19:45 ` Alexei Starovoitov
2024-09-03 22:10 ` Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 08/11] rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 09/11] rcutorture: Add reader_flavor parameter for SRCU readers Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 10/11] rcutorture: Add srcu_read_lock_lite() support to rcutorture.reader_flavor Paul E. McKenney
2024-09-03 16:33 ` [PATCH rcu 11/11] refscale: Add srcu_read_lock_lite() support using "srcu-lite" Paul E. McKenney
2024-09-03 21:38 ` [PATCH rcu 0/11] Add light-weight readers for SRCU Kent Overstreet
2024-09-03 22:13 ` Paul E. McKenney
2024-09-03 23:03 ` Kent Overstreet
2024-09-04 10:54 ` Paul E. McKenney
2024-09-03 22:08 ` Andrii Nakryiko
2024-09-03 22:20 ` Paul E. McKenney
2024-10-02 19:59 ` Andrii Nakryiko
2024-10-03 16:09 ` Paul E. McKenney
2024-09-04 16:52 ` [PATCH rcu [12/11] srcu: Allow inlining of __srcu_read_{,un}lock_lite() Paul E. McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox