* [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further
@ 2022-06-27 12:37 Neeraj Upadhyay
2022-06-27 22:45 ` Paul E. McKenney
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Neeraj Upadhyay @ 2022-06-27 12:37 UTC (permalink / raw)
To: paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai,
joel
Cc: linux-kernel, zhangfei.gao, boqun.feng, urezki,
shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger,
chenxiang66, maz, Neeraj Upadhyay
Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited
grace periods") highlights a problem where aggressively blocking
SRCU expedited grace periods, as was introduced in commit
282d8998e997 ("srcu: Prevent expedited GPs and blocking readers
from consuming CPU"), introduces ~2 minutes delay to the overall
~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd"
cmdline on qemu, which results in very high rate of memslots
add/remove, which causes > ~6000 synchronize_srcu() calls for
kvm->srcu SRCU instance.
Below table captures the experiments done by Zhangfei Gao, Shameer,
to measure the boottime impact with various values of non-sleeping
per phase counts, with HZ_250 and preemption enabled:
+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE | Boot time (s) |
+──────────────────────────+────────────────+
| 100 | 30.053 |
| 150 | 25.151 |
| 200 | 20.704 |
| 250 | 15.748 |
| 500 | 11.401 |
| 1000 | 11.443 |
| 10000 | 11.258 |
| 1000000 | 11.154 |
+──────────────────────────+────────────────+
Analysis on the experiment results showed improved boot time
with non blocking delays close to one jiffy duration. This
was also seen when number of per-phase iterations were scaled
to one jiffy.
So, this change scales per-grace-period phase number of non-sleeping
polls, soiuch that, non-sleeping polls are done for one jiffy. In addition
to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate
the delay used for scheduling callbacks, is replaced with the check for
expedited grace period. This is done, to schedule cbs for completed expedited
grace periods immediately, which results in improved boot time seen in
experiments.
In addition to the changes to default per phase delays, this change
adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays,
depending on their system configuration requirements.
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
---
.../admin-guide/kernel-parameters.txt | 18 +++++
kernel/rcu/srcutree.c | 79 ++++++++++++++-----
2 files changed, 78 insertions(+), 19 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index af647714c113..7e34086c64f5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5805,6 +5805,24 @@
expediting. Set to zero to disable automatic
expediting.
+ srcutree.srcu_max_nodelay [KNL]
+ Specifies the number of no-delay instances
+ per jiffy for which the SRCU grace period
+ worker thread will be rescheduled with zero
+ delay. Beyond this limit, worker thread will
+ be rescheduled with a sleep delay of one jiffy.
+
+ srcutree.srcu_max_nodelay_phase [KNL]
+ Specifies the per-grace-period phase, number of
+ non-sleeping polls of readers. Beyond this limit,
+ grace period worker thread will be rescheduled
+ with a sleep delay of one jiffy, between each
+ rescan of the readers, for a grace period phase.
+
+ srcutree.srcu_retry_check_delay [KNL]
+ Specifies number of microseconds of non-sleeping
+ delay between each non-sleeping poll of readers.
+
srcutree.small_contention_lim [KNL]
Specifies the number of update-side contention
events per jiffy will be tolerated before
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 0db7873f4e95..006828b9c41a 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -511,10 +511,49 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
return sum;
}
-#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
-#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
-#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
-#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
+/*
+ * We use an adaptive strategy for synchronize_srcu() and especially for
+ * synchronize_srcu_expedited(). We spin for a fixed time period
+ * (defined below, boot time configurable) to allow SRCU readers to exit
+ * their read-side critical sections. If there are still some readers
+ * after one jiffy, we repeatedly block for one jiffy time periods.
+ * The blocking time is increased as the grace-period age increases,
+ * with max blocking time capped at 10 jiffies.
+ */
+#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5
+
+static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY;
+module_param(srcu_retry_check_delay, ulong, 0444);
+
+#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
+#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
+
+#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on default per-GP-phase
+ // no-delay instances.
+#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on default per-GP-phase
+ // no-delay instances.
+
+#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low))
+#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high))
+// per-GP-phase no-delay instances adjusted to allow non-sleeping poll upto
+// one jiffies time duration. Mult by 2 is done to factor in the srcu_get_delay()
+// called from process_srcu().
+#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \
+ (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY)
+
+// Maximum per-GP-phase consecutive no-delay instances.
+#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \
+ SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, \
+ SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \
+ SRCU_DEFAULT_MAX_NODELAY_PHASE_HI))
+
+static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE;
+module_param(srcu_max_nodelay_phase, ulong, 0444);
+
+#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
+
+static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY;
+module_param(srcu_max_nodelay, ulong, 0444);
/*
* Return grace-period delay, zero if there are expedited grace
@@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp)
jbase += j - gpstart;
if (!jbase) {
WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
- if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
+ if (READ_ONCE(ssp->srcu_n_exp_nodelay) > srcu_max_nodelay_phase)
jbase = 1;
}
}
@@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
}
EXPORT_SYMBOL_GPL(__srcu_read_unlock);
-/*
- * We use an adaptive strategy for synchronize_srcu() and especially for
- * synchronize_srcu_expedited(). We spin for a fixed time period
- * (defined below) to allow SRCU readers to exit their read-side critical
- * sections. If there are still some readers after a few microseconds,
- * we repeatedly block for 1-millisecond time periods.
- */
-#define SRCU_RETRY_CHECK_DELAY 5
-
/*
* Start an SRCU grace period.
*/
@@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp
*/
static void srcu_gp_end(struct srcu_struct *ssp)
{
- unsigned long cbdelay;
+ unsigned long cbdelay = 1;
bool cbs;
bool last_lvl;
int cpu;
@@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
spin_lock_irq_rcu_node(ssp);
idx = rcu_seq_state(ssp->srcu_gp_seq);
WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
- cbdelay = !!srcu_get_delay(ssp);
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+ cbdelay = 0;
+
WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
rcu_seq_end(&ssp->srcu_gp_seq);
gpseq = rcu_seq_current(&ssp->srcu_gp_seq);
@@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp,
*/
static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
{
+ unsigned long curdelay;
+
+ curdelay = !srcu_get_delay(ssp);
+
for (;;) {
if (srcu_readers_active_idx_check(ssp, idx))
return true;
- if (--trycount + !srcu_get_delay(ssp) <= 0)
+ if ((--trycount + curdelay) <= 0)
return false;
- udelay(SRCU_RETRY_CHECK_DELAY);
+ udelay(srcu_retry_check_delay);
}
}
@@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct *work)
j = jiffies;
if (READ_ONCE(ssp->reschedule_jiffies) == j) {
WRITE_ONCE(ssp->reschedule_count, READ_ONCE(ssp->reschedule_count) + 1);
- if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY)
+ if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay)
curdelay = 1;
} else {
WRITE_ONCE(ssp->reschedule_count, 1);
@@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void)
pr_info("Hierarchical SRCU implementation.\n");
if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF)
pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", exp_holdoff);
+ if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY)
+ pr_info("\tNon-default retry check delay of %lu us.\n", srcu_retry_check_delay);
+ if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY)
+ pr_info("\tNon-default max no-delay of %lu.\n", srcu_max_nodelay);
+ pr_info("\tMax phase no-delay instances is %lu.\n", srcu_max_nodelay_phase);
return 0;
}
early_initcall(srcu_bootup_announce);
--
2.17.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-27 12:37 [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further Neeraj Upadhyay @ 2022-06-27 22:45 ` Paul E. McKenney 2022-06-28 3:15 ` Neeraj Upadhyay 2022-06-28 2:14 ` Zhangfei Gao 2022-06-28 9:02 ` Marc Zyngier 2 siblings, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2022-06-27 22:45 UTC (permalink / raw) To: Neeraj Upadhyay Cc: frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On Mon, Jun 27, 2022 at 06:07:06PM +0530, Neeraj Upadhyay wrote: > Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > grace periods") highlights a problem where aggressively blocking > SRCU expedited grace periods, as was introduced in commit > 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > from consuming CPU"), introduces ~2 minutes delay to the overall > ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > cmdline on qemu, which results in very high rate of memslots > add/remove, which causes > ~6000 synchronize_srcu() calls for > kvm->srcu SRCU instance. > > Below table captures the experiments done by Zhangfei Gao, Shameer, Zhangfei Gao and Shameerali Kolothum without commas? > to measure the boottime impact with various values of non-sleeping > per phase counts, with HZ_250 and preemption enabled: > > +──────────────────────────+────────────────+ > | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > +──────────────────────────+────────────────+ > | 100 | 30.053 | > | 150 | 25.151 | > | 200 | 20.704 | > | 250 | 15.748 | > | 500 | 11.401 | > | 1000 | 11.443 | > | 10000 | 11.258 | > | 1000000 | 11.154 | > +──────────────────────────+────────────────+ > > Analysis on the experiment results showed improved boot time > with non blocking delays close to one jiffy duration. This > was also seen when number of per-phase iterations were scaled > to one jiffy. > > So, this change scales per-grace-period phase number of non-sleeping > polls, soiuch that, non-sleeping polls are done for one jiffy. In addition such that? > to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate > the delay used for scheduling callbacks, is replaced with the check for > expedited grace period. This is done, to schedule cbs for completed expedited > grace periods immediately, which results in improved boot time seen in > experiments. > > In addition to the changes to default per phase delays, this change > adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > This allows users to configure the srcu grace period scanning delays, > depending on their system configuration requirements. > > Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> I have queued this on an experimental branch quic_neeraju.2022.06.27a for testing purposes. One question below. > --- > .../admin-guide/kernel-parameters.txt | 18 +++++ > kernel/rcu/srcutree.c | 79 ++++++++++++++----- > 2 files changed, 78 insertions(+), 19 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index af647714c113..7e34086c64f5 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5805,6 +5805,24 @@ > expediting. Set to zero to disable automatic > expediting. > > + srcutree.srcu_max_nodelay [KNL] > + Specifies the number of no-delay instances > + per jiffy for which the SRCU grace period > + worker thread will be rescheduled with zero > + delay. Beyond this limit, worker thread will > + be rescheduled with a sleep delay of one jiffy. > + > + srcutree.srcu_max_nodelay_phase [KNL] > + Specifies the per-grace-period phase, number of > + non-sleeping polls of readers. Beyond this limit, > + grace period worker thread will be rescheduled > + with a sleep delay of one jiffy, between each > + rescan of the readers, for a grace period phase. > + > + srcutree.srcu_retry_check_delay [KNL] > + Specifies number of microseconds of non-sleeping > + delay between each non-sleeping poll of readers. > + > srcutree.small_contention_lim [KNL] > Specifies the number of update-side contention > events per jiffy will be tolerated before > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 0db7873f4e95..006828b9c41a 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > return sum; > } > > -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > +/* > + * We use an adaptive strategy for synchronize_srcu() and especially for > + * synchronize_srcu_expedited(). We spin for a fixed time period > + * (defined below, boot time configurable) to allow SRCU readers to exit > + * their read-side critical sections. If there are still some readers > + * after one jiffy, we repeatedly block for one jiffy time periods. > + * The blocking time is increased as the grace-period age increases, > + * with max blocking time capped at 10 jiffies. > + */ > +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 > + > +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; > +module_param(srcu_retry_check_delay, ulong, 0444); > + > +#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > + > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on default per-GP-phase > + // no-delay instances. > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on default per-GP-phase > + // no-delay instances. > + > +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) > +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high)) Can these just be clamp()? Or does its type checking get in the way? Thanx, Paul > +// per-GP-phase no-delay instances adjusted to allow non-sleeping poll upto > +// one jiffies time duration. Mult by 2 is done to factor in the srcu_get_delay() > +// called from process_srcu(). > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ > + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) > + > +// Maximum per-GP-phase consecutive no-delay instances. > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ > + SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, \ > + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ > + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) > + > +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; > +module_param(srcu_max_nodelay_phase, ulong, 0444); > + > +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > + > +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; > +module_param(srcu_max_nodelay, ulong, 0444); > > /* > * Return grace-period delay, zero if there are expedited grace > @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp) > jbase += j - gpstart; > if (!jbase) { > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > srcu_max_nodelay_phase) > jbase = 1; > } > } > @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) > } > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > > -/* > - * We use an adaptive strategy for synchronize_srcu() and especially for > - * synchronize_srcu_expedited(). We spin for a fixed time period > - * (defined below) to allow SRCU readers to exit their read-side critical > - * sections. If there are still some readers after a few microseconds, > - * we repeatedly block for 1-millisecond time periods. > - */ > -#define SRCU_RETRY_CHECK_DELAY 5 > - > /* > * Start an SRCU grace period. > */ > @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp > */ > static void srcu_gp_end(struct srcu_struct *ssp) > { > - unsigned long cbdelay; > + unsigned long cbdelay = 1; > bool cbs; > bool last_lvl; > int cpu; > @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) > spin_lock_irq_rcu_node(ssp); > idx = rcu_seq_state(ssp->srcu_gp_seq); > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); > - cbdelay = !!srcu_get_delay(ssp); > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > + cbdelay = 0; > + > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); > rcu_seq_end(&ssp->srcu_gp_seq); > gpseq = rcu_seq_current(&ssp->srcu_gp_seq); > @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > */ > static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > { > + unsigned long curdelay; > + > + curdelay = !srcu_get_delay(ssp); > + > for (;;) { > if (srcu_readers_active_idx_check(ssp, idx)) > return true; > - if (--trycount + !srcu_get_delay(ssp) <= 0) > + if ((--trycount + curdelay) <= 0) > return false; > - udelay(SRCU_RETRY_CHECK_DELAY); > + udelay(srcu_retry_check_delay); > } > } > > @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct *work) > j = jiffies; > if (READ_ONCE(ssp->reschedule_jiffies) == j) { > WRITE_ONCE(ssp->reschedule_count, READ_ONCE(ssp->reschedule_count) + 1); > - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) > + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) > curdelay = 1; > } else { > WRITE_ONCE(ssp->reschedule_count, 1); > @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) > pr_info("Hierarchical SRCU implementation.\n"); > if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) > pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", exp_holdoff); > + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) > + pr_info("\tNon-default retry check delay of %lu us.\n", srcu_retry_check_delay); > + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) > + pr_info("\tNon-default max no-delay of %lu.\n", srcu_max_nodelay); > + pr_info("\tMax phase no-delay instances is %lu.\n", srcu_max_nodelay_phase); > return 0; > } > early_initcall(srcu_bootup_announce); > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-27 22:45 ` Paul E. McKenney @ 2022-06-28 3:15 ` Neeraj Upadhyay 2022-06-28 4:26 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Neeraj Upadhyay @ 2022-06-28 3:15 UTC (permalink / raw) To: paulmck Cc: frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On 6/28/2022 4:15 AM, Paul E. McKenney wrote: > On Mon, Jun 27, 2022 at 06:07:06PM +0530, Neeraj Upadhyay wrote: >> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >> grace periods") highlights a problem where aggressively blocking >> SRCU expedited grace periods, as was introduced in commit >> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >> from consuming CPU"), introduces ~2 minutes delay to the overall >> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >> cmdline on qemu, which results in very high rate of memslots >> add/remove, which causes > ~6000 synchronize_srcu() calls for >> kvm->srcu SRCU instance. >> >> Below table captures the experiments done by Zhangfei Gao, Shameer, > > Zhangfei Gao and Shameerali Kolothum without commas? > Will fix in next version >> to measure the boottime impact with various values of non-sleeping >> per phase counts, with HZ_250 and preemption enabled: >> >> +──────────────────────────+────────────────+ >> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >> +──────────────────────────+────────────────+ >> | 100 | 30.053 | >> | 150 | 25.151 | >> | 200 | 20.704 | >> | 250 | 15.748 | >> | 500 | 11.401 | >> | 1000 | 11.443 | >> | 10000 | 11.258 | >> | 1000000 | 11.154 | >> +──────────────────────────+────────────────+ >> >> Analysis on the experiment results showed improved boot time >> with non blocking delays close to one jiffy duration. This >> was also seen when number of per-phase iterations were scaled >> to one jiffy. >> >> So, this change scales per-grace-period phase number of non-sleeping >> polls, soiuch that, non-sleeping polls are done for one jiffy. In addition > > such that? > oops, will fix. >> to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate >> the delay used for scheduling callbacks, is replaced with the check for >> expedited grace period. This is done, to schedule cbs for completed expedited >> grace periods immediately, which results in improved boot time seen in >> experiments. >> >> In addition to the changes to default per phase delays, this change >> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >> This allows users to configure the srcu grace period scanning delays, >> depending on their system configuration requirements. >> >> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > I have queued this on an experimental branch quic_neeraju.2022.06.27a > for testing purposes. One question below. Thanks! > >> --- >> .../admin-guide/kernel-parameters.txt | 18 +++++ >> kernel/rcu/srcutree.c | 79 ++++++++++++++----- >> 2 files changed, 78 insertions(+), 19 deletions(-) >> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt >> index af647714c113..7e34086c64f5 100644 >> --- a/Documentation/admin-guide/kernel-parameters.txt >> +++ b/Documentation/admin-guide/kernel-parameters.txt >> @@ -5805,6 +5805,24 @@ >> expediting. Set to zero to disable automatic >> expediting. >> >> + srcutree.srcu_max_nodelay [KNL] >> + Specifies the number of no-delay instances >> + per jiffy for which the SRCU grace period >> + worker thread will be rescheduled with zero >> + delay. Beyond this limit, worker thread will >> + be rescheduled with a sleep delay of one jiffy. >> + >> + srcutree.srcu_max_nodelay_phase [KNL] >> + Specifies the per-grace-period phase, number of >> + non-sleeping polls of readers. Beyond this limit, >> + grace period worker thread will be rescheduled >> + with a sleep delay of one jiffy, between each >> + rescan of the readers, for a grace period phase. >> + >> + srcutree.srcu_retry_check_delay [KNL] >> + Specifies number of microseconds of non-sleeping >> + delay between each non-sleeping poll of readers. >> + >> srcutree.small_contention_lim [KNL] >> Specifies the number of update-side contention >> events per jiffy will be tolerated before >> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >> index 0db7873f4e95..006828b9c41a 100644 >> --- a/kernel/rcu/srcutree.c >> +++ b/kernel/rcu/srcutree.c >> @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct srcu_struct *ssp) >> return sum; >> } >> >> -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. >> -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. >> -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. >> -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. >> +/* >> + * We use an adaptive strategy for synchronize_srcu() and especially for >> + * synchronize_srcu_expedited(). We spin for a fixed time period >> + * (defined below, boot time configurable) to allow SRCU readers to exit >> + * their read-side critical sections. If there are still some readers >> + * after one jiffy, we repeatedly block for one jiffy time periods. >> + * The blocking time is increased as the grace-period age increases, >> + * with max blocking time capped at 10 jiffies. >> + */ >> +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 >> + >> +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; >> +module_param(srcu_retry_check_delay, ulong, 0444); >> + >> +#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. >> +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. >> + >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on default per-GP-phase >> + // no-delay instances. >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on default per-GP-phase >> + // no-delay instances. >> + >> +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) >> +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high)) > > Can these just be clamp()? Or does its type checking get in the way? > When I use clamp() or clamp_t(), I get warning like below: CC kernel/rcu/srcutree.o In file included from ./include/linux/kernel.h:26:0, from ./arch/x86/include/asm/percpu.h:27, from ./arch/x86/include/asm/current.h:6, from ./include/linux/mutex.h:14, from kernel/rcu/srcutree.c:19: ./include/linux/minmax.h:30:50: error: braced-group within expression allowed only inside a function #define __cmp_once(x, y, unique_x, unique_y, op) ({ \ ^ ./include/linux/minmax.h:20:21: note: in definition of macro ‘__typecheck’ (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) Which seems to be coming from braced usage in __cmp_once() #define __cmp_once(x, y, unique_x, unique_y, op) ({ \ typeof(x) unique_x = (x); \ typeof(y) unique_y = (y); \ __cmp(unique_x, unique_y, op); }) ... as I am using clamp() to initialize module param , which is outside of a function. static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; Thanks Neeraj > Thanx, Paul > >> +// per-GP-phase no-delay instances adjusted to allow non-sleeping poll upto >> +// one jiffies time duration. Mult by 2 is done to factor in the srcu_get_delay() >> +// called from process_srcu(). >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ >> + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) >> + >> +// Maximum per-GP-phase consecutive no-delay instances. >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ >> + SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, \ >> + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ >> + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) >> + >> +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; >> +module_param(srcu_max_nodelay_phase, ulong, 0444); >> + >> +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive no-delay instances. >> + >> +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; >> +module_param(srcu_max_nodelay, ulong, 0444); >> >> /* >> * Return grace-period delay, zero if there are expedited grace >> @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp) >> jbase += j - gpstart; >> if (!jbase) { >> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) >> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > srcu_max_nodelay_phase) >> jbase = 1; >> } >> } >> @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) >> } >> EXPORT_SYMBOL_GPL(__srcu_read_unlock); >> >> -/* >> - * We use an adaptive strategy for synchronize_srcu() and especially for >> - * synchronize_srcu_expedited(). We spin for a fixed time period >> - * (defined below) to allow SRCU readers to exit their read-side critical >> - * sections. If there are still some readers after a few microseconds, >> - * we repeatedly block for 1-millisecond time periods. >> - */ >> -#define SRCU_RETRY_CHECK_DELAY 5 >> - >> /* >> * Start an SRCU grace period. >> */ >> @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp >> */ >> static void srcu_gp_end(struct srcu_struct *ssp) >> { >> - unsigned long cbdelay; >> + unsigned long cbdelay = 1; >> bool cbs; >> bool last_lvl; >> int cpu; >> @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >> spin_lock_irq_rcu_node(ssp); >> idx = rcu_seq_state(ssp->srcu_gp_seq); >> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >> - cbdelay = !!srcu_get_delay(ssp); >> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >> + cbdelay = 0; >> + >> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >> rcu_seq_end(&ssp->srcu_gp_seq); >> gpseq = rcu_seq_current(&ssp->srcu_gp_seq); >> @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, >> */ >> static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) >> { >> + unsigned long curdelay; >> + >> + curdelay = !srcu_get_delay(ssp); >> + >> for (;;) { >> if (srcu_readers_active_idx_check(ssp, idx)) >> return true; >> - if (--trycount + !srcu_get_delay(ssp) <= 0) >> + if ((--trycount + curdelay) <= 0) >> return false; >> - udelay(SRCU_RETRY_CHECK_DELAY); >> + udelay(srcu_retry_check_delay); >> } >> } >> >> @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct *work) >> j = jiffies; >> if (READ_ONCE(ssp->reschedule_jiffies) == j) { >> WRITE_ONCE(ssp->reschedule_count, READ_ONCE(ssp->reschedule_count) + 1); >> - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) >> + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) >> curdelay = 1; >> } else { >> WRITE_ONCE(ssp->reschedule_count, 1); >> @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) >> pr_info("Hierarchical SRCU implementation.\n"); >> if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) >> pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", exp_holdoff); >> + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) >> + pr_info("\tNon-default retry check delay of %lu us.\n", srcu_retry_check_delay); >> + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) >> + pr_info("\tNon-default max no-delay of %lu.\n", srcu_max_nodelay); >> + pr_info("\tMax phase no-delay instances is %lu.\n", srcu_max_nodelay_phase); >> return 0; >> } >> early_initcall(srcu_bootup_announce); >> -- >> 2.17.1 >> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 3:15 ` Neeraj Upadhyay @ 2022-06-28 4:26 ` Paul E. McKenney 0 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2022-06-28 4:26 UTC (permalink / raw) To: Neeraj Upadhyay Cc: frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On Tue, Jun 28, 2022 at 08:45:50AM +0530, Neeraj Upadhyay wrote: > > > On 6/28/2022 4:15 AM, Paul E. McKenney wrote: > > On Mon, Jun 27, 2022 at 06:07:06PM +0530, Neeraj Upadhyay wrote: > > > Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > > > grace periods") highlights a problem where aggressively blocking > > > SRCU expedited grace periods, as was introduced in commit > > > 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > > > from consuming CPU"), introduces ~2 minutes delay to the overall > > > ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > > > cmdline on qemu, which results in very high rate of memslots > > > add/remove, which causes > ~6000 synchronize_srcu() calls for > > > kvm->srcu SRCU instance. > > > > > > Below table captures the experiments done by Zhangfei Gao, Shameer, > > > > Zhangfei Gao and Shameerali Kolothum without commas? > > > > Will fix in next version > > > > to measure the boottime impact with various values of non-sleeping > > > per phase counts, with HZ_250 and preemption enabled: > > > > > > +──────────────────────────+────────────────+ > > > | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > > > +──────────────────────────+────────────────+ > > > | 100 | 30.053 | > > > | 150 | 25.151 | > > > | 200 | 20.704 | > > > | 250 | 15.748 | > > > | 500 | 11.401 | > > > | 1000 | 11.443 | > > > | 10000 | 11.258 | > > > | 1000000 | 11.154 | > > > +──────────────────────────+────────────────+ > > > > > > Analysis on the experiment results showed improved boot time > > > with non blocking delays close to one jiffy duration. This > > > was also seen when number of per-phase iterations were scaled > > > to one jiffy. > > > > > > So, this change scales per-grace-period phase number of non-sleeping > > > polls, soiuch that, non-sleeping polls are done for one jiffy. In addition > > > > such that? > > > > oops, will fix. > > > > to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate > > > the delay used for scheduling callbacks, is replaced with the check for > > > expedited grace period. This is done, to schedule cbs for completed expedited > > > grace periods immediately, which results in improved boot time seen in > > > experiments. > > > > > > In addition to the changes to default per phase delays, this change > > > adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > > > srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > > > This allows users to configure the srcu grace period scanning delays, > > > depending on their system configuration requirements. > > > > > > Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > > > I have queued this on an experimental branch quic_neeraju.2022.06.27a > > for testing purposes. One question below. > > Thanks! > > > > > > --- > > > .../admin-guide/kernel-parameters.txt | 18 +++++ > > > kernel/rcu/srcutree.c | 79 ++++++++++++++----- > > > 2 files changed, 78 insertions(+), 19 deletions(-) > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > index af647714c113..7e34086c64f5 100644 > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > @@ -5805,6 +5805,24 @@ > > > expediting. Set to zero to disable automatic > > > expediting. > > > + srcutree.srcu_max_nodelay [KNL] > > > + Specifies the number of no-delay instances > > > + per jiffy for which the SRCU grace period > > > + worker thread will be rescheduled with zero > > > + delay. Beyond this limit, worker thread will > > > + be rescheduled with a sleep delay of one jiffy. > > > + > > > + srcutree.srcu_max_nodelay_phase [KNL] > > > + Specifies the per-grace-period phase, number of > > > + non-sleeping polls of readers. Beyond this limit, > > > + grace period worker thread will be rescheduled > > > + with a sleep delay of one jiffy, between each > > > + rescan of the readers, for a grace period phase. > > > + > > > + srcutree.srcu_retry_check_delay [KNL] > > > + Specifies number of microseconds of non-sleeping > > > + delay between each non-sleeping poll of readers. > > > + > > > srcutree.small_contention_lim [KNL] > > > Specifies the number of update-side contention > > > events per jiffy will be tolerated before > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 0db7873f4e95..006828b9c41a 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > > return sum; > > > } > > > -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > > > -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > > > -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > > > -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > > +/* > > > + * We use an adaptive strategy for synchronize_srcu() and especially for > > > + * synchronize_srcu_expedited(). We spin for a fixed time period > > > + * (defined below, boot time configurable) to allow SRCU readers to exit > > > + * their read-side critical sections. If there are still some readers > > > + * after one jiffy, we repeatedly block for one jiffy time periods. > > > + * The blocking time is increased as the grace-period age increases, > > > + * with max blocking time capped at 10 jiffies. > > > + */ > > > +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 > > > + > > > +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; > > > +module_param(srcu_retry_check_delay, ulong, 0444); > > > + > > > +#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > > > +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > > > + > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on default per-GP-phase > > > + // no-delay instances. > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on default per-GP-phase > > > + // no-delay instances. > > > + > > > +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) > > > +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high)) > > > > Can these just be clamp()? Or does its type checking get in the way? > > > > When I use clamp() or clamp_t(), I get warning like below: > > > CC kernel/rcu/srcutree.o > In file included from ./include/linux/kernel.h:26:0, > from ./arch/x86/include/asm/percpu.h:27, > from ./arch/x86/include/asm/current.h:6, > from ./include/linux/mutex.h:14, > from kernel/rcu/srcutree.c:19: > ./include/linux/minmax.h:30:50: error: braced-group within expression > allowed only inside a function > #define __cmp_once(x, y, unique_x, unique_y, op) ({ \ > ^ > ./include/linux/minmax.h:20:21: note: in definition of macro ‘__typecheck’ > (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) > > > Which seems to be coming from braced usage in __cmp_once() > #define __cmp_once(x, y, unique_x, unique_y, op) ({ \ > typeof(x) unique_x = (x); \ > typeof(y) unique_y = (y); \ > __cmp(unique_x, unique_y, op); }) > > ... as I am using clamp() to initialize module param , which is outside of a > function. > > static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; OK, so the type-checking is indeed causing you trouble. Hey, I had to ask! Thanx, Paul > Thanks > Neeraj > > > Thanx, Paul > > > > > +// per-GP-phase no-delay instances adjusted to allow non-sleeping poll upto > > > +// one jiffies time duration. Mult by 2 is done to factor in the srcu_get_delay() > > > +// called from process_srcu(). > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ > > > + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) > > > + > > > +// Maximum per-GP-phase consecutive no-delay instances. > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ > > > + SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, \ > > > + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ > > > + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) > > > + > > > +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; > > > +module_param(srcu_max_nodelay_phase, ulong, 0444); > > > + > > > +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > > > + > > > +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; > > > +module_param(srcu_max_nodelay, ulong, 0444); > > > /* > > > * Return grace-period delay, zero if there are expedited grace > > > @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp) > > > jbase += j - gpstart; > > > if (!jbase) { > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > > > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > srcu_max_nodelay_phase) > > > jbase = 1; > > > } > > > } > > > @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) > > > } > > > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > > > -/* > > > - * We use an adaptive strategy for synchronize_srcu() and especially for > > > - * synchronize_srcu_expedited(). We spin for a fixed time period > > > - * (defined below) to allow SRCU readers to exit their read-side critical > > > - * sections. If there are still some readers after a few microseconds, > > > - * we repeatedly block for 1-millisecond time periods. > > > - */ > > > -#define SRCU_RETRY_CHECK_DELAY 5 > > > - > > > /* > > > * Start an SRCU grace period. > > > */ > > > @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp > > > */ > > > static void srcu_gp_end(struct srcu_struct *ssp) > > > { > > > - unsigned long cbdelay; > > > + unsigned long cbdelay = 1; > > > bool cbs; > > > bool last_lvl; > > > int cpu; > > > @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) > > > spin_lock_irq_rcu_node(ssp); > > > idx = rcu_seq_state(ssp->srcu_gp_seq); > > > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); > > > - cbdelay = !!srcu_get_delay(ssp); > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > + cbdelay = 0; > > > + > > > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); > > > rcu_seq_end(&ssp->srcu_gp_seq); > > > gpseq = rcu_seq_current(&ssp->srcu_gp_seq); > > > @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > */ > > > static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > > > { > > > + unsigned long curdelay; > > > + > > > + curdelay = !srcu_get_delay(ssp); > > > + > > > for (;;) { > > > if (srcu_readers_active_idx_check(ssp, idx)) > > > return true; > > > - if (--trycount + !srcu_get_delay(ssp) <= 0) > > > + if ((--trycount + curdelay) <= 0) > > > return false; > > > - udelay(SRCU_RETRY_CHECK_DELAY); > > > + udelay(srcu_retry_check_delay); > > > } > > > } > > > @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct *work) > > > j = jiffies; > > > if (READ_ONCE(ssp->reschedule_jiffies) == j) { > > > WRITE_ONCE(ssp->reschedule_count, READ_ONCE(ssp->reschedule_count) + 1); > > > - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) > > > + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) > > > curdelay = 1; > > > } else { > > > WRITE_ONCE(ssp->reschedule_count, 1); > > > @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) > > > pr_info("Hierarchical SRCU implementation.\n"); > > > if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) > > > pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", exp_holdoff); > > > + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) > > > + pr_info("\tNon-default retry check delay of %lu us.\n", srcu_retry_check_delay); > > > + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) > > > + pr_info("\tNon-default max no-delay of %lu.\n", srcu_max_nodelay); > > > + pr_info("\tMax phase no-delay instances is %lu.\n", srcu_max_nodelay_phase); > > > return 0; > > > } > > > early_initcall(srcu_bootup_announce); > > > -- > > > 2.17.1 > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-27 12:37 [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further Neeraj Upadhyay 2022-06-27 22:45 ` Paul E. McKenney @ 2022-06-28 2:14 ` Zhangfei Gao 2022-06-28 3:22 ` Neeraj Upadhyay 2022-06-28 9:02 ` Marc Zyngier 2 siblings, 1 reply; 15+ messages in thread From: Zhangfei Gao @ 2022-06-28 2:14 UTC (permalink / raw) To: Neeraj Upadhyay, paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel Cc: linux-kernel, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On 2022/6/27 下午8:37, Neeraj Upadhyay wrote: > Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > grace periods") highlights a problem where aggressively blocking > SRCU expedited grace periods, as was introduced in commit > 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > from consuming CPU"), introduces ~2 minutes delay to the overall > ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > cmdline on qemu, which results in very high rate of memslots > add/remove, which causes > ~6000 synchronize_srcu() calls for > kvm->srcu SRCU instance. > > Below table captures the experiments done by Zhangfei Gao, Shameer, > to measure the boottime impact with various values of non-sleeping > per phase counts, with HZ_250 and preemption enabled: > > +──────────────────────────+────────────────+ > | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > +──────────────────────────+────────────────+ > | 100 | 30.053 | > | 150 | 25.151 | > | 200 | 20.704 | > | 250 | 15.748 | > | 500 | 11.401 | > | 1000 | 11.443 | > | 10000 | 11.258 | > | 1000000 | 11.154 | > +──────────────────────────+────────────────+ > > Analysis on the experiment results showed improved boot time > with non blocking delays close to one jiffy duration. This > was also seen when number of per-phase iterations were scaled > to one jiffy. > > So, this change scales per-grace-period phase number of non-sleeping > polls, soiuch that, non-sleeping polls are done for one jiffy. In addition > to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate > the delay used for scheduling callbacks, is replaced with the check for > expedited grace period. This is done, to schedule cbs for completed expedited > grace periods immediately, which results in improved boot time seen in > experiments. > > In addition to the changes to default per phase delays, this change > adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > This allows users to configure the srcu grace period scanning delays, > depending on their system configuration requirements. > > Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Test on arm64 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev 5.19-rc3 arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y CONFIG_HZ_250=y real 0m11.498s user 0m2.911s sys 0m1.171s As comparison (Since can not directly revert on linux-rcu.git dev, so use rc1 instead) 5.19-rc1 + Revert "srcu: Prevent expedited GPs and blocking readers from consuming CPU" real 0m8.173s user 0m3.024s sys 0m0.959s 5.19-rc1 real 2m41.433s user 0m3.097s sys 0m1.177s Thanks > --- > .../admin-guide/kernel-parameters.txt | 18 +++++ > kernel/rcu/srcutree.c | 79 ++++++++++++++----- > 2 files changed, 78 insertions(+), 19 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index af647714c113..7e34086c64f5 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5805,6 +5805,24 @@ > expediting. Set to zero to disable automatic > expediting. > > + srcutree.srcu_max_nodelay [KNL] > + Specifies the number of no-delay instances > + per jiffy for which the SRCU grace period > + worker thread will be rescheduled with zero > + delay. Beyond this limit, worker thread will > + be rescheduled with a sleep delay of one jiffy. > + > + srcutree.srcu_max_nodelay_phase [KNL] > + Specifies the per-grace-period phase, number of > + non-sleeping polls of readers. Beyond this limit, > + grace period worker thread will be rescheduled > + with a sleep delay of one jiffy, between each > + rescan of the readers, for a grace period phase. > + > + srcutree.srcu_retry_check_delay [KNL] > + Specifies number of microseconds of non-sleeping > + delay between each non-sleeping poll of readers. > + > srcutree.small_contention_lim [KNL] > Specifies the number of update-side contention > events per jiffy will be tolerated before > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 0db7873f4e95..006828b9c41a 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > return sum; > } > > -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > +/* > + * We use an adaptive strategy for synchronize_srcu() and especially for > + * synchronize_srcu_expedited(). We spin for a fixed time period > + * (defined below, boot time configurable) to allow SRCU readers to exit > + * their read-side critical sections. If there are still some readers > + * after one jiffy, we repeatedly block for one jiffy time periods. > + * The blocking time is increased as the grace-period age increases, > + * with max blocking time capped at 10 jiffies. > + */ > +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 > + > +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; > +module_param(srcu_retry_check_delay, ulong, 0444); > + > +#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > + > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on default per-GP-phase > + // no-delay instances. > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on default per-GP-phase > + // no-delay instances. > + > +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) > +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high)) > +// per-GP-phase no-delay instances adjusted to allow non-sleeping poll upto > +// one jiffies time duration. Mult by 2 is done to factor in the srcu_get_delay() > +// called from process_srcu(). > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ > + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) > + > +// Maximum per-GP-phase consecutive no-delay instances. > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ > + SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, \ > + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ > + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) > + > +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; > +module_param(srcu_max_nodelay_phase, ulong, 0444); > + > +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > + > +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; > +module_param(srcu_max_nodelay, ulong, 0444); > > /* > * Return grace-period delay, zero if there are expedited grace > @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp) > jbase += j - gpstart; > if (!jbase) { > WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > srcu_max_nodelay_phase) > jbase = 1; > } > } > @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) > } > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > > -/* > - * We use an adaptive strategy for synchronize_srcu() and especially for > - * synchronize_srcu_expedited(). We spin for a fixed time period > - * (defined below) to allow SRCU readers to exit their read-side critical > - * sections. If there are still some readers after a few microseconds, > - * we repeatedly block for 1-millisecond time periods. > - */ > -#define SRCU_RETRY_CHECK_DELAY 5 > - > /* > * Start an SRCU grace period. > */ > @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp > */ > static void srcu_gp_end(struct srcu_struct *ssp) > { > - unsigned long cbdelay; > + unsigned long cbdelay = 1; > bool cbs; > bool last_lvl; > int cpu; > @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) > spin_lock_irq_rcu_node(ssp); > idx = rcu_seq_state(ssp->srcu_gp_seq); > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); > - cbdelay = !!srcu_get_delay(ssp); > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > + cbdelay = 0; > + > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); > rcu_seq_end(&ssp->srcu_gp_seq); > gpseq = rcu_seq_current(&ssp->srcu_gp_seq); > @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > */ > static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > { > + unsigned long curdelay; > + > + curdelay = !srcu_get_delay(ssp); > + > for (;;) { > if (srcu_readers_active_idx_check(ssp, idx)) > return true; > - if (--trycount + !srcu_get_delay(ssp) <= 0) > + if ((--trycount + curdelay) <= 0) > return false; > - udelay(SRCU_RETRY_CHECK_DELAY); > + udelay(srcu_retry_check_delay); > } > } > > @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct *work) > j = jiffies; > if (READ_ONCE(ssp->reschedule_jiffies) == j) { > WRITE_ONCE(ssp->reschedule_count, READ_ONCE(ssp->reschedule_count) + 1); > - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) > + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) > curdelay = 1; > } else { > WRITE_ONCE(ssp->reschedule_count, 1); > @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) > pr_info("Hierarchical SRCU implementation.\n"); > if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) > pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", exp_holdoff); > + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) > + pr_info("\tNon-default retry check delay of %lu us.\n", srcu_retry_check_delay); > + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) > + pr_info("\tNon-default max no-delay of %lu.\n", srcu_max_nodelay); > + pr_info("\tMax phase no-delay instances is %lu.\n", srcu_max_nodelay_phase); > return 0; > } > early_initcall(srcu_bootup_announce); ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 2:14 ` Zhangfei Gao @ 2022-06-28 3:22 ` Neeraj Upadhyay 2022-06-28 4:03 ` Zhangfei Gao 0 siblings, 1 reply; 15+ messages in thread From: Neeraj Upadhyay @ 2022-06-28 3:22 UTC (permalink / raw) To: Zhangfei Gao, paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel Cc: linux-kernel, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On 6/28/2022 7:44 AM, Zhangfei Gao wrote: > > > On 2022/6/27 下午8:37, Neeraj Upadhyay wrote: >> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >> grace periods") highlights a problem where aggressively blocking >> SRCU expedited grace periods, as was introduced in commit >> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >> from consuming CPU"), introduces ~2 minutes delay to the overall >> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >> cmdline on qemu, which results in very high rate of memslots >> add/remove, which causes > ~6000 synchronize_srcu() calls for >> kvm->srcu SRCU instance. >> >> Below table captures the experiments done by Zhangfei Gao, Shameer, >> to measure the boottime impact with various values of non-sleeping >> per phase counts, with HZ_250 and preemption enabled: >> >> +──────────────────────────+────────────────+ >> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >> +──────────────────────────+────────────────+ >> | 100 | 30.053 | >> | 150 | 25.151 | >> | 200 | 20.704 | >> | 250 | 15.748 | >> | 500 | 11.401 | >> | 1000 | 11.443 | >> | 10000 | 11.258 | >> | 1000000 | 11.154 | >> +──────────────────────────+────────────────+ >> >> Analysis on the experiment results showed improved boot time >> with non blocking delays close to one jiffy duration. This >> was also seen when number of per-phase iterations were scaled >> to one jiffy. >> >> So, this change scales per-grace-period phase number of non-sleeping >> polls, soiuch that, non-sleeping polls are done for one jiffy. In >> addition >> to this, srcu_get_delay() call in srcu_gp_end(), which is used to >> calculate >> the delay used for scheduling callbacks, is replaced with the check for >> expedited grace period. This is done, to schedule cbs for completed >> expedited >> grace periods immediately, which results in improved boot time seen in >> experiments. >> >> In addition to the changes to default per phase delays, this change >> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >> This allows users to configure the srcu grace period scanning delays, >> depending on their system configuration requirements. >> >> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > Test on arm64 > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev > 5.19-rc3 > > arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y > CONFIG_HZ_250=y > If it is possible to try out, do you get similar results with HZ_1000? > real 0m11.498s > user 0m2.911s > sys 0m1.171s > > > As comparison (Since can not directly revert on linux-rcu.git dev, so > use rc1 instead) > > 5.19-rc1 + Revert "srcu: Prevent expedited GPs and blocking readers from > consuming CPU" > > real 0m8.173s > user 0m3.024s > sys 0m0.959s > > 5.19-rc1 > real 2m41.433s > user 0m3.097s > sys 0m1.177s > Thanks! Can I add your Tested-by in subsequent versions of the patch? The numbers are aligned to the initial experiments, without using long retry delays of 100 us. Using long delays might have impact on other workloads, which could be sensitive to the delay between retries. So, I didn't include that in the patch. Thanks Neeraj > Thanks > >> --- >> .../admin-guide/kernel-parameters.txt | 18 +++++ >> kernel/rcu/srcutree.c | 79 ++++++++++++++----- >> 2 files changed, 78 insertions(+), 19 deletions(-) >> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt >> b/Documentation/admin-guide/kernel-parameters.txt >> index af647714c113..7e34086c64f5 100644 >> --- a/Documentation/admin-guide/kernel-parameters.txt >> +++ b/Documentation/admin-guide/kernel-parameters.txt >> @@ -5805,6 +5805,24 @@ >> expediting. Set to zero to disable automatic >> expediting. >> + srcutree.srcu_max_nodelay [KNL] >> + Specifies the number of no-delay instances >> + per jiffy for which the SRCU grace period >> + worker thread will be rescheduled with zero >> + delay. Beyond this limit, worker thread will >> + be rescheduled with a sleep delay of one jiffy. >> + >> + srcutree.srcu_max_nodelay_phase [KNL] >> + Specifies the per-grace-period phase, number of >> + non-sleeping polls of readers. Beyond this limit, >> + grace period worker thread will be rescheduled >> + with a sleep delay of one jiffy, between each >> + rescan of the readers, for a grace period phase. >> + >> + srcutree.srcu_retry_check_delay [KNL] >> + Specifies number of microseconds of non-sleeping >> + delay between each non-sleeping poll of readers. >> + >> srcutree.small_contention_lim [KNL] >> Specifies the number of update-side contention >> events per jiffy will be tolerated before >> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >> index 0db7873f4e95..006828b9c41a 100644 >> --- a/kernel/rcu/srcutree.c >> +++ b/kernel/rcu/srcutree.c >> @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct >> srcu_struct *ssp) >> return sum; >> } >> -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs >> pending. >> -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from >> slow readers. >> -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase >> consecutive no-delay instances. >> -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay >> instances. >> +/* >> + * We use an adaptive strategy for synchronize_srcu() and especially for >> + * synchronize_srcu_expedited(). We spin for a fixed time period >> + * (defined below, boot time configurable) to allow SRCU readers to exit >> + * their read-side critical sections. If there are still some readers >> + * after one jiffy, we repeatedly block for one jiffy time periods. >> + * The blocking time is increased as the grace-period age increases, >> + * with max blocking time capped at 10 jiffies. >> + */ >> +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 >> + >> +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; >> +module_param(srcu_retry_check_delay, ulong, 0444); >> + >> +#define SRCU_INTERVAL 1 // Base delay if no expedited >> GPs pending. >> +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >> from slow readers. >> + >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on >> default per-GP-phase >> + // no-delay instances. >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on >> default per-GP-phase >> + // no-delay instances. >> + >> +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) >> +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high)) >> +// per-GP-phase no-delay instances adjusted to allow non-sleeping >> poll upto >> +// one jiffies time duration. Mult by 2 is done to factor in the >> srcu_get_delay() >> +// called from process_srcu(). >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ >> + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) >> + >> +// Maximum per-GP-phase consecutive no-delay instances. >> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ >> + >> SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, >> \ >> + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ >> + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) >> + >> +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; >> +module_param(srcu_max_nodelay_phase, ulong, 0444); >> + >> +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive >> no-delay instances. >> + >> +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; >> +module_param(srcu_max_nodelay, ulong, 0444); >> /* >> * Return grace-period delay, zero if there are expedited grace >> @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct >> srcu_struct *ssp) >> jbase += j - gpstart; >> if (!jbase) { >> WRITE_ONCE(ssp->srcu_n_exp_nodelay, >> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >> SRCU_MAX_NODELAY_PHASE) >> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >> srcu_max_nodelay_phase) >> jbase = 1; >> } >> } >> @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, >> int idx) >> } >> EXPORT_SYMBOL_GPL(__srcu_read_unlock); >> -/* >> - * We use an adaptive strategy for synchronize_srcu() and especially for >> - * synchronize_srcu_expedited(). We spin for a fixed time period >> - * (defined below) to allow SRCU readers to exit their read-side >> critical >> - * sections. If there are still some readers after a few microseconds, >> - * we repeatedly block for 1-millisecond time periods. >> - */ >> -#define SRCU_RETRY_CHECK_DELAY 5 >> - >> /* >> * Start an SRCU grace period. >> */ >> @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct >> srcu_struct *ssp, struct srcu_node *snp >> */ >> static void srcu_gp_end(struct srcu_struct *ssp) >> { >> - unsigned long cbdelay; >> + unsigned long cbdelay = 1; >> bool cbs; >> bool last_lvl; >> int cpu; >> @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >> spin_lock_irq_rcu_node(ssp); >> idx = rcu_seq_state(ssp->srcu_gp_seq); >> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >> - cbdelay = !!srcu_get_delay(ssp); >> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >> + cbdelay = 0; >> + >> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >> rcu_seq_end(&ssp->srcu_gp_seq); >> gpseq = rcu_seq_current(&ssp->srcu_gp_seq); >> @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct >> srcu_struct *ssp, struct srcu_data *sdp, >> */ >> static bool try_check_zero(struct srcu_struct *ssp, int idx, int >> trycount) >> { >> + unsigned long curdelay; >> + >> + curdelay = !srcu_get_delay(ssp); >> + >> for (;;) { >> if (srcu_readers_active_idx_check(ssp, idx)) >> return true; >> - if (--trycount + !srcu_get_delay(ssp) <= 0) >> + if ((--trycount + curdelay) <= 0) >> return false; >> - udelay(SRCU_RETRY_CHECK_DELAY); >> + udelay(srcu_retry_check_delay); >> } >> } >> @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct *work) >> j = jiffies; >> if (READ_ONCE(ssp->reschedule_jiffies) == j) { >> WRITE_ONCE(ssp->reschedule_count, >> READ_ONCE(ssp->reschedule_count) + 1); >> - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) >> + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) >> curdelay = 1; >> } else { >> WRITE_ONCE(ssp->reschedule_count, 1); >> @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) >> pr_info("Hierarchical SRCU implementation.\n"); >> if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) >> pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", >> exp_holdoff); >> + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) >> + pr_info("\tNon-default retry check delay of %lu us.\n", >> srcu_retry_check_delay); >> + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) >> + pr_info("\tNon-default max no-delay of %lu.\n", >> srcu_max_nodelay); >> + pr_info("\tMax phase no-delay instances is %lu.\n", >> srcu_max_nodelay_phase); >> return 0; >> } >> early_initcall(srcu_bootup_announce); > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 3:22 ` Neeraj Upadhyay @ 2022-06-28 4:03 ` Zhangfei Gao 2022-06-28 4:24 ` Paul E. McKenney ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Zhangfei Gao @ 2022-06-28 4:03 UTC (permalink / raw) To: Neeraj Upadhyay, paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel Cc: linux-kernel, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On 2022/6/28 上午11:22, Neeraj Upadhyay wrote: > > > On 6/28/2022 7:44 AM, Zhangfei Gao wrote: >> >> >> On 2022/6/27 下午8:37, Neeraj Upadhyay wrote: >>> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >>> grace periods") highlights a problem where aggressively blocking >>> SRCU expedited grace periods, as was introduced in commit >>> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >>> from consuming CPU"), introduces ~2 minutes delay to the overall >>> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >>> cmdline on qemu, which results in very high rate of memslots >>> add/remove, which causes > ~6000 synchronize_srcu() calls for >>> kvm->srcu SRCU instance. >>> >>> Below table captures the experiments done by Zhangfei Gao, Shameer, >>> to measure the boottime impact with various values of non-sleeping >>> per phase counts, with HZ_250 and preemption enabled: >>> >>> +──────────────────────────+────────────────+ >>> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >>> +──────────────────────────+────────────────+ >>> | 100 | 30.053 | >>> | 150 | 25.151 | >>> | 200 | 20.704 | >>> | 250 | 15.748 | >>> | 500 | 11.401 | >>> | 1000 | 11.443 | >>> | 10000 | 11.258 | >>> | 1000000 | 11.154 | >>> +──────────────────────────+────────────────+ >>> >>> Analysis on the experiment results showed improved boot time >>> with non blocking delays close to one jiffy duration. This >>> was also seen when number of per-phase iterations were scaled >>> to one jiffy. >>> >>> So, this change scales per-grace-period phase number of non-sleeping >>> polls, soiuch that, non-sleeping polls are done for one jiffy. In >>> addition >>> to this, srcu_get_delay() call in srcu_gp_end(), which is used to >>> calculate >>> the delay used for scheduling callbacks, is replaced with the check for >>> expedited grace period. This is done, to schedule cbs for completed >>> expedited >>> grace periods immediately, which results in improved boot time seen in >>> experiments. >>> >>> In addition to the changes to default per phase delays, this change >>> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >>> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >>> This allows users to configure the srcu grace period scanning delays, >>> depending on their system configuration requirements. >>> >>> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> >> >> Test on arm64 >> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev >> 5.19-rc3 >> >> arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y >> CONFIG_HZ_250=y >> > > If it is possible to try out, do you get similar results with HZ_1000? CONFIG_HZ_1000=y CONFIG_HZ=1000 with this patch real 0m10.560s user 0m3.230s sys 0m1.024s Revert this patch real 0m44.014s user 0m3.005s sys 0m1.287s > >> real 0m11.498s >> user 0m2.911s >> sys 0m1.171s >> >> >> As comparison (Since can not directly revert on linux-rcu.git dev, so >> use rc1 instead) >> >> 5.19-rc1 + Revert "srcu: Prevent expedited GPs and blocking readers >> from consuming CPU" >> >> real 0m8.173s >> user 0m3.024s >> sys 0m0.959s >> >> 5.19-rc1 >> real 2m41.433s >> user 0m3.097s >> sys 0m1.177s >> > > Thanks! Can I add your Tested-by in subsequent versions of the patch? Curiously, is this treated as fixed? 11.498s vs. v5.18 8.173s Thanks > > The numbers are aligned to the initial experiments, without using long > retry delays of 100 us. Using long delays might have impact on other > workloads, which could be sensitive to the delay between retries. So, > I didn't include that in the patch. > > > Thanks > Neeraj > >> Thanks >> >>> --- >>> .../admin-guide/kernel-parameters.txt | 18 +++++ >>> kernel/rcu/srcutree.c | 79 >>> ++++++++++++++----- >>> 2 files changed, 78 insertions(+), 19 deletions(-) >>> >>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>> b/Documentation/admin-guide/kernel-parameters.txt >>> index af647714c113..7e34086c64f5 100644 >>> --- a/Documentation/admin-guide/kernel-parameters.txt >>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>> @@ -5805,6 +5805,24 @@ >>> expediting. Set to zero to disable automatic >>> expediting. >>> + srcutree.srcu_max_nodelay [KNL] >>> + Specifies the number of no-delay instances >>> + per jiffy for which the SRCU grace period >>> + worker thread will be rescheduled with zero >>> + delay. Beyond this limit, worker thread will >>> + be rescheduled with a sleep delay of one jiffy. >>> + >>> + srcutree.srcu_max_nodelay_phase [KNL] >>> + Specifies the per-grace-period phase, number of >>> + non-sleeping polls of readers. Beyond this limit, >>> + grace period worker thread will be rescheduled >>> + with a sleep delay of one jiffy, between each >>> + rescan of the readers, for a grace period phase. >>> + >>> + srcutree.srcu_retry_check_delay [KNL] >>> + Specifies number of microseconds of non-sleeping >>> + delay between each non-sleeping poll of readers. >>> + >>> srcutree.small_contention_lim [KNL] >>> Specifies the number of update-side contention >>> events per jiffy will be tolerated before >>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>> index 0db7873f4e95..006828b9c41a 100644 >>> --- a/kernel/rcu/srcutree.c >>> +++ b/kernel/rcu/srcutree.c >>> @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct >>> srcu_struct *ssp) >>> return sum; >>> } >>> -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs >>> pending. >>> -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >>> from slow readers. >>> -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase >>> consecutive no-delay instances. >>> -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay >>> instances. >>> +/* >>> + * We use an adaptive strategy for synchronize_srcu() and >>> especially for >>> + * synchronize_srcu_expedited(). We spin for a fixed time period >>> + * (defined below, boot time configurable) to allow SRCU readers to >>> exit >>> + * their read-side critical sections. If there are still some readers >>> + * after one jiffy, we repeatedly block for one jiffy time periods. >>> + * The blocking time is increased as the grace-period age increases, >>> + * with max blocking time capped at 10 jiffies. >>> + */ >>> +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 >>> + >>> +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; >>> +module_param(srcu_retry_check_delay, ulong, 0444); >>> + >>> +#define SRCU_INTERVAL 1 // Base delay if no expedited >>> GPs pending. >>> +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >>> from slow readers. >>> + >>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on >>> default per-GP-phase >>> + // no-delay instances. >>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark >>> on default per-GP-phase >>> + // no-delay instances. >>> + >>> +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) >>> +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : >>> (high)) >>> +// per-GP-phase no-delay instances adjusted to allow non-sleeping >>> poll upto >>> +// one jiffies time duration. Mult by 2 is done to factor in the >>> srcu_get_delay() >>> +// called from process_srcu(). >>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ >>> + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) >>> + >>> +// Maximum per-GP-phase consecutive no-delay instances. >>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ >>> + >>> SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, >>> \ >>> + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ >>> + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) >>> + >>> +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; >>> +module_param(srcu_max_nodelay_phase, ulong, 0444); >>> + >>> +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive >>> no-delay instances. >>> + >>> +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; >>> +module_param(srcu_max_nodelay, ulong, 0444); >>> /* >>> * Return grace-period delay, zero if there are expedited grace >>> @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct >>> srcu_struct *ssp) >>> jbase += j - gpstart; >>> if (!jbase) { >>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, >>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>> SRCU_MAX_NODELAY_PHASE) >>> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>> srcu_max_nodelay_phase) >>> jbase = 1; >>> } >>> } >>> @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct >>> *ssp, int idx) >>> } >>> EXPORT_SYMBOL_GPL(__srcu_read_unlock); >>> -/* >>> - * We use an adaptive strategy for synchronize_srcu() and >>> especially for >>> - * synchronize_srcu_expedited(). We spin for a fixed time period >>> - * (defined below) to allow SRCU readers to exit their read-side >>> critical >>> - * sections. If there are still some readers after a few >>> microseconds, >>> - * we repeatedly block for 1-millisecond time periods. >>> - */ >>> -#define SRCU_RETRY_CHECK_DELAY 5 >>> - >>> /* >>> * Start an SRCU grace period. >>> */ >>> @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct >>> srcu_struct *ssp, struct srcu_node *snp >>> */ >>> static void srcu_gp_end(struct srcu_struct *ssp) >>> { >>> - unsigned long cbdelay; >>> + unsigned long cbdelay = 1; >>> bool cbs; >>> bool last_lvl; >>> int cpu; >>> @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>> spin_lock_irq_rcu_node(ssp); >>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>> - cbdelay = !!srcu_get_delay(ssp); >>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>> + cbdelay = 0; >>> + >>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >>> rcu_seq_end(&ssp->srcu_gp_seq); >>> gpseq = rcu_seq_current(&ssp->srcu_gp_seq); >>> @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct >>> srcu_struct *ssp, struct srcu_data *sdp, >>> */ >>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int >>> trycount) >>> { >>> + unsigned long curdelay; >>> + >>> + curdelay = !srcu_get_delay(ssp); >>> + >>> for (;;) { >>> if (srcu_readers_active_idx_check(ssp, idx)) >>> return true; >>> - if (--trycount + !srcu_get_delay(ssp) <= 0) >>> + if ((--trycount + curdelay) <= 0) >>> return false; >>> - udelay(SRCU_RETRY_CHECK_DELAY); >>> + udelay(srcu_retry_check_delay); >>> } >>> } >>> @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct >>> *work) >>> j = jiffies; >>> if (READ_ONCE(ssp->reschedule_jiffies) == j) { >>> WRITE_ONCE(ssp->reschedule_count, >>> READ_ONCE(ssp->reschedule_count) + 1); >>> - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) >>> + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) >>> curdelay = 1; >>> } else { >>> WRITE_ONCE(ssp->reschedule_count, 1); >>> @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) >>> pr_info("Hierarchical SRCU implementation.\n"); >>> if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) >>> pr_info("\tNon-default auto-expedite holdoff of %lu >>> ns.\n", exp_holdoff); >>> + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) >>> + pr_info("\tNon-default retry check delay of %lu us.\n", >>> srcu_retry_check_delay); >>> + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) >>> + pr_info("\tNon-default max no-delay of %lu.\n", >>> srcu_max_nodelay); >>> + pr_info("\tMax phase no-delay instances is %lu.\n", >>> srcu_max_nodelay_phase); >>> return 0; >>> } >>> early_initcall(srcu_bootup_announce); >> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 4:03 ` Zhangfei Gao @ 2022-06-28 4:24 ` Paul E. McKenney 2022-06-28 4:26 ` Neeraj Upadhyay 2022-06-28 9:19 ` Neeraj Upadhyay 2 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2022-06-28 4:24 UTC (permalink / raw) To: Zhangfei Gao Cc: Neeraj Upadhyay, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On Tue, Jun 28, 2022 at 12:03:10PM +0800, Zhangfei Gao wrote: > > > On 2022/6/28 上午11:22, Neeraj Upadhyay wrote: > > > > > > On 6/28/2022 7:44 AM, Zhangfei Gao wrote: > > > > > > > > > On 2022/6/27 下午8:37, Neeraj Upadhyay wrote: > > > > Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > > > > grace periods") highlights a problem where aggressively blocking > > > > SRCU expedited grace periods, as was introduced in commit > > > > 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > > > > from consuming CPU"), introduces ~2 minutes delay to the overall > > > > ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > > > > cmdline on qemu, which results in very high rate of memslots > > > > add/remove, which causes > ~6000 synchronize_srcu() calls for > > > > kvm->srcu SRCU instance. > > > > > > > > Below table captures the experiments done by Zhangfei Gao, Shameer, > > > > to measure the boottime impact with various values of non-sleeping > > > > per phase counts, with HZ_250 and preemption enabled: > > > > > > > > +──────────────────────────+────────────────+ > > > > | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > > > > +──────────────────────────+────────────────+ > > > > | 100 | 30.053 | > > > > | 150 | 25.151 | > > > > | 200 | 20.704 | > > > > | 250 | 15.748 | > > > > | 500 | 11.401 | > > > > | 1000 | 11.443 | > > > > | 10000 | 11.258 | > > > > | 1000000 | 11.154 | > > > > +──────────────────────────+────────────────+ > > > > > > > > Analysis on the experiment results showed improved boot time > > > > with non blocking delays close to one jiffy duration. This > > > > was also seen when number of per-phase iterations were scaled > > > > to one jiffy. > > > > > > > > So, this change scales per-grace-period phase number of non-sleeping > > > > polls, soiuch that, non-sleeping polls are done for one jiffy. > > > > In addition > > > > to this, srcu_get_delay() call in srcu_gp_end(), which is used > > > > to calculate > > > > the delay used for scheduling callbacks, is replaced with the check for > > > > expedited grace period. This is done, to schedule cbs for > > > > completed expedited > > > > grace periods immediately, which results in improved boot time seen in > > > > experiments. > > > > > > > > In addition to the changes to default per phase delays, this change > > > > adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > > > > srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > > > > This allows users to configure the srcu grace period scanning delays, > > > > depending on their system configuration requirements. > > > > > > > > Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > > > > > Test on arm64 > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev > > > 5.19-rc3 > > > > > > arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y > > > CONFIG_HZ_250=y > > > > > > > If it is possible to try out, do you get similar results with HZ_1000? > > CONFIG_HZ_1000=y > CONFIG_HZ=1000 > > with this patch > real 0m10.560s > user 0m3.230s > sys 0m1.024s > > Revert this patch > real 0m44.014s > user 0m3.005s > sys 0m1.287s > > > > > > real 0m11.498s > > > user 0m2.911s > > > sys 0m1.171s > > > > > > > > > As comparison (Since can not directly revert on linux-rcu.git dev, > > > so use rc1 instead) > > > > > > 5.19-rc1 + Revert "srcu: Prevent expedited GPs and blocking readers > > > from consuming CPU" > > > > > > real 0m8.173s > > > user 0m3.024s > > > sys 0m0.959s > > > > > > 5.19-rc1 > > > real 2m41.433s > > > user 0m3.097s > > > sys 0m1.177s > > > > > > > Thanks! Can I add your Tested-by in subsequent versions of the patch? > > Curiously, is this treated as fixed? > 11.498s vs. v5.18 8.173s Please keep in mind that v5.18 was hard-hanging for some people because of the original behavior of SRCU. So if you can make this go faster, so much the better, and I would welcome the patches. But not at the expanse of inflicting hard hangs on other people! Just out of curiosity, what is the total boot time for these QEMU "-bios QEMU_EFI.fd" runs? Thanx, Paul > Thanks > > > > > The numbers are aligned to the initial experiments, without using long > > retry delays of 100 us. Using long delays might have impact on other > > workloads, which could be sensitive to the delay between retries. So, > > I didn't include that in the patch. > > > > > > Thanks > > Neeraj > > > > > Thanks > > > > > > > --- > > > > .../admin-guide/kernel-parameters.txt | 18 +++++ > > > > kernel/rcu/srcutree.c | 79 > > > > ++++++++++++++----- > > > > 2 files changed, 78 insertions(+), 19 deletions(-) > > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > > > b/Documentation/admin-guide/kernel-parameters.txt > > > > index af647714c113..7e34086c64f5 100644 > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > @@ -5805,6 +5805,24 @@ > > > > expediting. Set to zero to disable automatic > > > > expediting. > > > > + srcutree.srcu_max_nodelay [KNL] > > > > + Specifies the number of no-delay instances > > > > + per jiffy for which the SRCU grace period > > > > + worker thread will be rescheduled with zero > > > > + delay. Beyond this limit, worker thread will > > > > + be rescheduled with a sleep delay of one jiffy. > > > > + > > > > + srcutree.srcu_max_nodelay_phase [KNL] > > > > + Specifies the per-grace-period phase, number of > > > > + non-sleeping polls of readers. Beyond this limit, > > > > + grace period worker thread will be rescheduled > > > > + with a sleep delay of one jiffy, between each > > > > + rescan of the readers, for a grace period phase. > > > > + > > > > + srcutree.srcu_retry_check_delay [KNL] > > > > + Specifies number of microseconds of non-sleeping > > > > + delay between each non-sleeping poll of readers. > > > > + > > > > srcutree.small_contention_lim [KNL] > > > > Specifies the number of update-side contention > > > > events per jiffy will be tolerated before > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > index 0db7873f4e95..006828b9c41a 100644 > > > > --- a/kernel/rcu/srcutree.c > > > > +++ b/kernel/rcu/srcutree.c > > > > @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct > > > > srcu_struct *ssp) > > > > return sum; > > > > } > > > > -#define SRCU_INTERVAL 1 // Base delay if no expedited > > > > GPs pending. > > > > -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay > > > > from slow readers. > > > > -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase > > > > consecutive no-delay instances. > > > > -#define SRCU_MAX_NODELAY 100 // Maximum consecutive > > > > no-delay instances. > > > > +/* > > > > + * We use an adaptive strategy for synchronize_srcu() and > > > > especially for > > > > + * synchronize_srcu_expedited(). We spin for a fixed time period > > > > + * (defined below, boot time configurable) to allow SRCU > > > > readers to exit > > > > + * their read-side critical sections. If there are still some readers > > > > + * after one jiffy, we repeatedly block for one jiffy time periods. > > > > + * The blocking time is increased as the grace-period age increases, > > > > + * with max blocking time capped at 10 jiffies. > > > > + */ > > > > +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 > > > > + > > > > +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; > > > > +module_param(srcu_retry_check_delay, ulong, 0444); > > > > + > > > > +#define SRCU_INTERVAL 1 // Base delay if no > > > > expedited GPs pending. > > > > +#define SRCU_MAX_INTERVAL 10 // Maximum incremental > > > > delay from slow readers. > > > > + > > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark > > > > on default per-GP-phase > > > > + // no-delay instances. > > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // > > > > Highmark on default per-GP-phase > > > > + // no-delay instances. > > > > + > > > > +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) > > > > +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) > > > > : (high)) > > > > +// per-GP-phase no-delay instances adjusted to allow > > > > non-sleeping poll upto > > > > +// one jiffies time duration. Mult by 2 is done to factor in > > > > the srcu_get_delay() > > > > +// called from process_srcu(). > > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ > > > > + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) > > > > + > > > > +// Maximum per-GP-phase consecutive no-delay instances. > > > > +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ > > > > + SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, > > > > \ > > > > + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ > > > > + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) > > > > + > > > > +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; > > > > +module_param(srcu_max_nodelay_phase, ulong, 0444); > > > > + > > > > +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum > > > > consecutive no-delay instances. > > > > + > > > > +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; > > > > +module_param(srcu_max_nodelay, ulong, 0444); > > > > /* > > > > * Return grace-period delay, zero if there are expedited grace > > > > @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct > > > > srcu_struct *ssp) > > > > jbase += j - gpstart; > > > > if (!jbase) { > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay, > > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > > > > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > > > > SRCU_MAX_NODELAY_PHASE) > > > > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > > > > srcu_max_nodelay_phase) > > > > jbase = 1; > > > > } > > > > } > > > > @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct > > > > *ssp, int idx) > > > > } > > > > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > > > > -/* > > > > - * We use an adaptive strategy for synchronize_srcu() and > > > > especially for > > > > - * synchronize_srcu_expedited(). We spin for a fixed time period > > > > - * (defined below) to allow SRCU readers to exit their > > > > read-side critical > > > > - * sections. If there are still some readers after a few > > > > microseconds, > > > > - * we repeatedly block for 1-millisecond time periods. > > > > - */ > > > > -#define SRCU_RETRY_CHECK_DELAY 5 > > > > - > > > > /* > > > > * Start an SRCU grace period. > > > > */ > > > > @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct > > > > srcu_struct *ssp, struct srcu_node *snp > > > > */ > > > > static void srcu_gp_end(struct srcu_struct *ssp) > > > > { > > > > - unsigned long cbdelay; > > > > + unsigned long cbdelay = 1; > > > > bool cbs; > > > > bool last_lvl; > > > > int cpu; > > > > @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) > > > > spin_lock_irq_rcu_node(ssp); > > > > idx = rcu_seq_state(ssp->srcu_gp_seq); > > > > WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); > > > > - cbdelay = !!srcu_get_delay(ssp); > > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > + cbdelay = 0; > > > > + > > > > WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); > > > > rcu_seq_end(&ssp->srcu_gp_seq); > > > > gpseq = rcu_seq_current(&ssp->srcu_gp_seq); > > > > @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct > > > > srcu_struct *ssp, struct srcu_data *sdp, > > > > */ > > > > static bool try_check_zero(struct srcu_struct *ssp, int idx, > > > > int trycount) > > > > { > > > > + unsigned long curdelay; > > > > + > > > > + curdelay = !srcu_get_delay(ssp); > > > > + > > > > for (;;) { > > > > if (srcu_readers_active_idx_check(ssp, idx)) > > > > return true; > > > > - if (--trycount + !srcu_get_delay(ssp) <= 0) > > > > + if ((--trycount + curdelay) <= 0) > > > > return false; > > > > - udelay(SRCU_RETRY_CHECK_DELAY); > > > > + udelay(srcu_retry_check_delay); > > > > } > > > > } > > > > @@ -1588,7 +1624,7 @@ static void process_srcu(struct > > > > work_struct *work) > > > > j = jiffies; > > > > if (READ_ONCE(ssp->reschedule_jiffies) == j) { > > > > WRITE_ONCE(ssp->reschedule_count, > > > > READ_ONCE(ssp->reschedule_count) + 1); > > > > - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) > > > > + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) > > > > curdelay = 1; > > > > } else { > > > > WRITE_ONCE(ssp->reschedule_count, 1); > > > > @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) > > > > pr_info("Hierarchical SRCU implementation.\n"); > > > > if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) > > > > pr_info("\tNon-default auto-expedite holdoff of %lu > > > > ns.\n", exp_holdoff); > > > > + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) > > > > + pr_info("\tNon-default retry check delay of %lu us.\n", > > > > srcu_retry_check_delay); > > > > + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) > > > > + pr_info("\tNon-default max no-delay of %lu.\n", > > > > srcu_max_nodelay); > > > > + pr_info("\tMax phase no-delay instances is %lu.\n", > > > > srcu_max_nodelay_phase); > > > > return 0; > > > > } > > > > early_initcall(srcu_bootup_announce); > > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 4:03 ` Zhangfei Gao 2022-06-28 4:24 ` Paul E. McKenney @ 2022-06-28 4:26 ` Neeraj Upadhyay 2022-06-28 9:19 ` Neeraj Upadhyay 2 siblings, 0 replies; 15+ messages in thread From: Neeraj Upadhyay @ 2022-06-28 4:26 UTC (permalink / raw) To: Zhangfei Gao, paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel Cc: linux-kernel, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On 6/28/2022 9:33 AM, Zhangfei Gao wrote: > > > On 2022/6/28 上午11:22, Neeraj Upadhyay wrote: >> >> >> On 6/28/2022 7:44 AM, Zhangfei Gao wrote: >>> >>> >>> On 2022/6/27 下午8:37, Neeraj Upadhyay wrote: >>>> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >>>> grace periods") highlights a problem where aggressively blocking >>>> SRCU expedited grace periods, as was introduced in commit >>>> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >>>> from consuming CPU"), introduces ~2 minutes delay to the overall >>>> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >>>> cmdline on qemu, which results in very high rate of memslots >>>> add/remove, which causes > ~6000 synchronize_srcu() calls for >>>> kvm->srcu SRCU instance. >>>> >>>> Below table captures the experiments done by Zhangfei Gao, Shameer, >>>> to measure the boottime impact with various values of non-sleeping >>>> per phase counts, with HZ_250 and preemption enabled: >>>> >>>> +──────────────────────────+────────────────+ >>>> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >>>> +──────────────────────────+────────────────+ >>>> | 100 | 30.053 | >>>> | 150 | 25.151 | >>>> | 200 | 20.704 | >>>> | 250 | 15.748 | >>>> | 500 | 11.401 | >>>> | 1000 | 11.443 | >>>> | 10000 | 11.258 | >>>> | 1000000 | 11.154 | >>>> +──────────────────────────+────────────────+ >>>> >>>> Analysis on the experiment results showed improved boot time >>>> with non blocking delays close to one jiffy duration. This >>>> was also seen when number of per-phase iterations were scaled >>>> to one jiffy. >>>> >>>> So, this change scales per-grace-period phase number of non-sleeping >>>> polls, soiuch that, non-sleeping polls are done for one jiffy. In >>>> addition >>>> to this, srcu_get_delay() call in srcu_gp_end(), which is used to >>>> calculate >>>> the delay used for scheduling callbacks, is replaced with the check for >>>> expedited grace period. This is done, to schedule cbs for completed >>>> expedited >>>> grace periods immediately, which results in improved boot time seen in >>>> experiments. >>>> >>>> In addition to the changes to default per phase delays, this change >>>> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >>>> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >>>> This allows users to configure the srcu grace period scanning delays, >>>> depending on their system configuration requirements. >>>> >>>> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> >>> >>> Test on arm64 >>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev >>> 5.19-rc3 >>> >>> arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y >>> CONFIG_HZ_250=y >>> >> >> If it is possible to try out, do you get similar results with HZ_1000? > > CONFIG_HZ_1000=y > CONFIG_HZ=1000 > > with this patch > real 0m10.560s > user 0m3.230s > sys 0m1.024s > > Revert this patch > real 0m44.014s > user 0m3.005s > sys 0m1.287s > Thank you very much for this data! >> >>> real 0m11.498s >>> user 0m2.911s >>> sys 0m1.171s >>> >>> >>> As comparison (Since can not directly revert on linux-rcu.git dev, so >>> use rc1 instead) >>> >>> 5.19-rc1 + Revert "srcu: Prevent expedited GPs and blocking readers >>> from consuming CPU" >>> >>> real 0m8.173s >>> user 0m3.024s >>> sys 0m0.959s >>> >>> 5.19-rc1 >>> real 2m41.433s >>> user 0m3.097s >>> sys 0m1.177s >>> >> >> Thanks! Can I add your Tested-by in subsequent versions of the patch? > > Curiously, is this treated as fixed? > 11.498s vs. v5.18 8.173s > I think it fixes the use case at least - deploy qemu VMs with flash emulation; however, I do not have information on, if ~11.5 seconds is acceptable from performance point of view, for this use case? Thanks Neeraj > Thanks > >> >> The numbers are aligned to the initial experiments, without using long >> retry delays of 100 us. Using long delays might have impact on other >> workloads, which could be sensitive to the delay between retries. So, >> I didn't include that in the patch. >> >> >> Thanks >> Neeraj >> >>> Thanks >>> >>>> --- >>>> .../admin-guide/kernel-parameters.txt | 18 +++++ >>>> kernel/rcu/srcutree.c | 79 >>>> ++++++++++++++----- >>>> 2 files changed, 78 insertions(+), 19 deletions(-) >>>> >>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>> b/Documentation/admin-guide/kernel-parameters.txt >>>> index af647714c113..7e34086c64f5 100644 >>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>> @@ -5805,6 +5805,24 @@ >>>> expediting. Set to zero to disable automatic >>>> expediting. >>>> + srcutree.srcu_max_nodelay [KNL] >>>> + Specifies the number of no-delay instances >>>> + per jiffy for which the SRCU grace period >>>> + worker thread will be rescheduled with zero >>>> + delay. Beyond this limit, worker thread will >>>> + be rescheduled with a sleep delay of one jiffy. >>>> + >>>> + srcutree.srcu_max_nodelay_phase [KNL] >>>> + Specifies the per-grace-period phase, number of >>>> + non-sleeping polls of readers. Beyond this limit, >>>> + grace period worker thread will be rescheduled >>>> + with a sleep delay of one jiffy, between each >>>> + rescan of the readers, for a grace period phase. >>>> + >>>> + srcutree.srcu_retry_check_delay [KNL] >>>> + Specifies number of microseconds of non-sleeping >>>> + delay between each non-sleeping poll of readers. >>>> + >>>> srcutree.small_contention_lim [KNL] >>>> Specifies the number of update-side contention >>>> events per jiffy will be tolerated before >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>> index 0db7873f4e95..006828b9c41a 100644 >>>> --- a/kernel/rcu/srcutree.c >>>> +++ b/kernel/rcu/srcutree.c >>>> @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct >>>> srcu_struct *ssp) >>>> return sum; >>>> } >>>> -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs >>>> pending. >>>> -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >>>> from slow readers. >>>> -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase >>>> consecutive no-delay instances. >>>> -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay >>>> instances. >>>> +/* >>>> + * We use an adaptive strategy for synchronize_srcu() and >>>> especially for >>>> + * synchronize_srcu_expedited(). We spin for a fixed time period >>>> + * (defined below, boot time configurable) to allow SRCU readers to >>>> exit >>>> + * their read-side critical sections. If there are still some readers >>>> + * after one jiffy, we repeatedly block for one jiffy time periods. >>>> + * The blocking time is increased as the grace-period age increases, >>>> + * with max blocking time capped at 10 jiffies. >>>> + */ >>>> +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 >>>> + >>>> +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; >>>> +module_param(srcu_retry_check_delay, ulong, 0444); >>>> + >>>> +#define SRCU_INTERVAL 1 // Base delay if no expedited >>>> GPs pending. >>>> +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >>>> from slow readers. >>>> + >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on >>>> default per-GP-phase >>>> + // no-delay instances. >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark >>>> on default per-GP-phase >>>> + // no-delay instances. >>>> + >>>> +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) >>>> +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : >>>> (high)) >>>> +// per-GP-phase no-delay instances adjusted to allow non-sleeping >>>> poll upto >>>> +// one jiffies time duration. Mult by 2 is done to factor in the >>>> srcu_get_delay() >>>> +// called from process_srcu(). >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ >>>> + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) >>>> + >>>> +// Maximum per-GP-phase consecutive no-delay instances. >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ >>>> + >>>> SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, >>>> \ >>>> + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ >>>> + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) >>>> + >>>> +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; >>>> +module_param(srcu_max_nodelay_phase, ulong, 0444); >>>> + >>>> +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive >>>> no-delay instances. >>>> + >>>> +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; >>>> +module_param(srcu_max_nodelay, ulong, 0444); >>>> /* >>>> * Return grace-period delay, zero if there are expedited grace >>>> @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct >>>> srcu_struct *ssp) >>>> jbase += j - gpstart; >>>> if (!jbase) { >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, >>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>>> SRCU_MAX_NODELAY_PHASE) >>>> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>>> srcu_max_nodelay_phase) >>>> jbase = 1; >>>> } >>>> } >>>> @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct >>>> *ssp, int idx) >>>> } >>>> EXPORT_SYMBOL_GPL(__srcu_read_unlock); >>>> -/* >>>> - * We use an adaptive strategy for synchronize_srcu() and >>>> especially for >>>> - * synchronize_srcu_expedited(). We spin for a fixed time period >>>> - * (defined below) to allow SRCU readers to exit their read-side >>>> critical >>>> - * sections. If there are still some readers after a few >>>> microseconds, >>>> - * we repeatedly block for 1-millisecond time periods. >>>> - */ >>>> -#define SRCU_RETRY_CHECK_DELAY 5 >>>> - >>>> /* >>>> * Start an SRCU grace period. >>>> */ >>>> @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct >>>> srcu_struct *ssp, struct srcu_node *snp >>>> */ >>>> static void srcu_gp_end(struct srcu_struct *ssp) >>>> { >>>> - unsigned long cbdelay; >>>> + unsigned long cbdelay = 1; >>>> bool cbs; >>>> bool last_lvl; >>>> int cpu; >>>> @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>>> spin_lock_irq_rcu_node(ssp); >>>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>>> - cbdelay = !!srcu_get_delay(ssp); >>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>> + cbdelay = 0; >>>> + >>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >>>> rcu_seq_end(&ssp->srcu_gp_seq); >>>> gpseq = rcu_seq_current(&ssp->srcu_gp_seq); >>>> @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct >>>> srcu_struct *ssp, struct srcu_data *sdp, >>>> */ >>>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int >>>> trycount) >>>> { >>>> + unsigned long curdelay; >>>> + >>>> + curdelay = !srcu_get_delay(ssp); >>>> + >>>> for (;;) { >>>> if (srcu_readers_active_idx_check(ssp, idx)) >>>> return true; >>>> - if (--trycount + !srcu_get_delay(ssp) <= 0) >>>> + if ((--trycount + curdelay) <= 0) >>>> return false; >>>> - udelay(SRCU_RETRY_CHECK_DELAY); >>>> + udelay(srcu_retry_check_delay); >>>> } >>>> } >>>> @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct >>>> *work) >>>> j = jiffies; >>>> if (READ_ONCE(ssp->reschedule_jiffies) == j) { >>>> WRITE_ONCE(ssp->reschedule_count, >>>> READ_ONCE(ssp->reschedule_count) + 1); >>>> - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) >>>> + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) >>>> curdelay = 1; >>>> } else { >>>> WRITE_ONCE(ssp->reschedule_count, 1); >>>> @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) >>>> pr_info("Hierarchical SRCU implementation.\n"); >>>> if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) >>>> pr_info("\tNon-default auto-expedite holdoff of %lu >>>> ns.\n", exp_holdoff); >>>> + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) >>>> + pr_info("\tNon-default retry check delay of %lu us.\n", >>>> srcu_retry_check_delay); >>>> + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) >>>> + pr_info("\tNon-default max no-delay of %lu.\n", >>>> srcu_max_nodelay); >>>> + pr_info("\tMax phase no-delay instances is %lu.\n", >>>> srcu_max_nodelay_phase); >>>> return 0; >>>> } >>>> early_initcall(srcu_bootup_announce); >>> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 4:03 ` Zhangfei Gao 2022-06-28 4:24 ` Paul E. McKenney 2022-06-28 4:26 ` Neeraj Upadhyay @ 2022-06-28 9:19 ` Neeraj Upadhyay 2 siblings, 0 replies; 15+ messages in thread From: Neeraj Upadhyay @ 2022-06-28 9:19 UTC (permalink / raw) To: Zhangfei Gao, paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel Cc: linux-kernel, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66, maz On 6/28/2022 9:33 AM, Zhangfei Gao wrote: > > > On 2022/6/28 上午11:22, Neeraj Upadhyay wrote: >> >> >> On 6/28/2022 7:44 AM, Zhangfei Gao wrote: >>> >>> >>> On 2022/6/27 下午8:37, Neeraj Upadhyay wrote: >>>> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >>>> grace periods") highlights a problem where aggressively blocking >>>> SRCU expedited grace periods, as was introduced in commit >>>> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >>>> from consuming CPU"), introduces ~2 minutes delay to the overall >>>> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >>>> cmdline on qemu, which results in very high rate of memslots >>>> add/remove, which causes > ~6000 synchronize_srcu() calls for >>>> kvm->srcu SRCU instance. >>>> >>>> Below table captures the experiments done by Zhangfei Gao, Shameer, >>>> to measure the boottime impact with various values of non-sleeping >>>> per phase counts, with HZ_250 and preemption enabled: >>>> >>>> +──────────────────────────+────────────────+ >>>> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >>>> +──────────────────────────+────────────────+ >>>> | 100 | 30.053 | >>>> | 150 | 25.151 | >>>> | 200 | 20.704 | >>>> | 250 | 15.748 | >>>> | 500 | 11.401 | >>>> | 1000 | 11.443 | >>>> | 10000 | 11.258 | >>>> | 1000000 | 11.154 | >>>> +──────────────────────────+────────────────+ >>>> >>>> Analysis on the experiment results showed improved boot time >>>> with non blocking delays close to one jiffy duration. This >>>> was also seen when number of per-phase iterations were scaled >>>> to one jiffy. >>>> >>>> So, this change scales per-grace-period phase number of non-sleeping >>>> polls, soiuch that, non-sleeping polls are done for one jiffy. In >>>> addition >>>> to this, srcu_get_delay() call in srcu_gp_end(), which is used to >>>> calculate >>>> the delay used for scheduling callbacks, is replaced with the check for >>>> expedited grace period. This is done, to schedule cbs for completed >>>> expedited >>>> grace periods immediately, which results in improved boot time seen in >>>> experiments. >>>> >>>> In addition to the changes to default per phase delays, this change >>>> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >>>> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >>>> This allows users to configure the srcu grace period scanning delays, >>>> depending on their system configuration requirements. >>>> >>>> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> >>> >>> Test on arm64 >>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev >>> 5.19-rc3 >>> >>> arch/arm64/configs/defconfig make defconfig CONFIG_PREEMPTION=y >>> CONFIG_HZ_250=y >>> >> >> If it is possible to try out, do you get similar results with HZ_1000? > > CONFIG_HZ_1000=y > CONFIG_HZ=1000 > > with this patch > real 0m10.560s > user 0m3.230s > sys 0m1.024s > > Revert this patch > real 0m44.014s > user 0m3.005s > sys 0m1.287s > >> >>> real 0m11.498s >>> user 0m2.911s >>> sys 0m1.171s >>> >>> >>> As comparison (Since can not directly revert on linux-rcu.git dev, so >>> use rc1 instead) >>> >>> 5.19-rc1 + Revert "srcu: Prevent expedited GPs and blocking readers >>> from consuming CPU" >>> >>> real 0m8.173s >>> user 0m3.024s >>> sys 0m0.959s >>> >>> 5.19-rc1 >>> real 2m41.433s >>> user 0m3.097s >>> sys 0m1.177s >>> >> >> Thanks! Can I add your Tested-by in subsequent versions of the patch? > > Curiously, is this treated as fixed? > 11.498s vs. v5.18 8.173s > If you use srcutree.srcu_max_nodelay=1000 bootarg, on top of this patch, do you see any improvement in boot time? Thanks Neeraj > Thanks > >> >> The numbers are aligned to the initial experiments, without using long >> retry delays of 100 us. Using long delays might have impact on other >> workloads, which could be sensitive to the delay between retries. So, >> I didn't include that in the patch. >> >> >> Thanks >> Neeraj >> >>> Thanks >>> >>>> --- >>>> .../admin-guide/kernel-parameters.txt | 18 +++++ >>>> kernel/rcu/srcutree.c | 79 >>>> ++++++++++++++----- >>>> 2 files changed, 78 insertions(+), 19 deletions(-) >>>> >>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>> b/Documentation/admin-guide/kernel-parameters.txt >>>> index af647714c113..7e34086c64f5 100644 >>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>> @@ -5805,6 +5805,24 @@ >>>> expediting. Set to zero to disable automatic >>>> expediting. >>>> + srcutree.srcu_max_nodelay [KNL] >>>> + Specifies the number of no-delay instances >>>> + per jiffy for which the SRCU grace period >>>> + worker thread will be rescheduled with zero >>>> + delay. Beyond this limit, worker thread will >>>> + be rescheduled with a sleep delay of one jiffy. >>>> + >>>> + srcutree.srcu_max_nodelay_phase [KNL] >>>> + Specifies the per-grace-period phase, number of >>>> + non-sleeping polls of readers. Beyond this limit, >>>> + grace period worker thread will be rescheduled >>>> + with a sleep delay of one jiffy, between each >>>> + rescan of the readers, for a grace period phase. >>>> + >>>> + srcutree.srcu_retry_check_delay [KNL] >>>> + Specifies number of microseconds of non-sleeping >>>> + delay between each non-sleeping poll of readers. >>>> + >>>> srcutree.small_contention_lim [KNL] >>>> Specifies the number of update-side contention >>>> events per jiffy will be tolerated before >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>> index 0db7873f4e95..006828b9c41a 100644 >>>> --- a/kernel/rcu/srcutree.c >>>> +++ b/kernel/rcu/srcutree.c >>>> @@ -511,10 +511,49 @@ static bool srcu_readers_active(struct >>>> srcu_struct *ssp) >>>> return sum; >>>> } >>>> -#define SRCU_INTERVAL 1 // Base delay if no expedited GPs >>>> pending. >>>> -#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >>>> from slow readers. >>>> -#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase >>>> consecutive no-delay instances. >>>> -#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay >>>> instances. >>>> +/* >>>> + * We use an adaptive strategy for synchronize_srcu() and >>>> especially for >>>> + * synchronize_srcu_expedited(). We spin for a fixed time period >>>> + * (defined below, boot time configurable) to allow SRCU readers to >>>> exit >>>> + * their read-side critical sections. If there are still some readers >>>> + * after one jiffy, we repeatedly block for one jiffy time periods. >>>> + * The blocking time is increased as the grace-period age increases, >>>> + * with max blocking time capped at 10 jiffies. >>>> + */ >>>> +#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5 >>>> + >>>> +static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY; >>>> +module_param(srcu_retry_check_delay, ulong, 0444); >>>> + >>>> +#define SRCU_INTERVAL 1 // Base delay if no expedited >>>> GPs pending. >>>> +#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay >>>> from slow readers. >>>> + >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on >>>> default per-GP-phase >>>> + // no-delay instances. >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark >>>> on default per-GP-phase >>>> + // no-delay instances. >>>> + >>>> +#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low)) >>>> +#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : >>>> (high)) >>>> +// per-GP-phase no-delay instances adjusted to allow non-sleeping >>>> poll upto >>>> +// one jiffies time duration. Mult by 2 is done to factor in the >>>> srcu_get_delay() >>>> +// called from process_srcu(). >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \ >>>> + (2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY) >>>> + >>>> +// Maximum per-GP-phase consecutive no-delay instances. >>>> +#define SRCU_DEFAULT_MAX_NODELAY_PHASE ( \ >>>> + >>>> SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, >>>> \ >>>> + SRCU_DEFAULT_MAX_NODELAY_PHASE_LO), \ >>>> + SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)) >>>> + >>>> +static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE; >>>> +module_param(srcu_max_nodelay_phase, ulong, 0444); >>>> + >>>> +#define SRCU_DEFAULT_MAX_NODELAY 100 // Maximum consecutive >>>> no-delay instances. >>>> + >>>> +static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY; >>>> +module_param(srcu_max_nodelay, ulong, 0444); >>>> /* >>>> * Return grace-period delay, zero if there are expedited grace >>>> @@ -535,7 +574,7 @@ static unsigned long srcu_get_delay(struct >>>> srcu_struct *ssp) >>>> jbase += j - gpstart; >>>> if (!jbase) { >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, >>>> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); >>>> - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>>> SRCU_MAX_NODELAY_PHASE) >>>> + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > >>>> srcu_max_nodelay_phase) >>>> jbase = 1; >>>> } >>>> } >>>> @@ -612,15 +651,6 @@ void __srcu_read_unlock(struct srcu_struct >>>> *ssp, int idx) >>>> } >>>> EXPORT_SYMBOL_GPL(__srcu_read_unlock); >>>> -/* >>>> - * We use an adaptive strategy for synchronize_srcu() and >>>> especially for >>>> - * synchronize_srcu_expedited(). We spin for a fixed time period >>>> - * (defined below) to allow SRCU readers to exit their read-side >>>> critical >>>> - * sections. If there are still some readers after a few >>>> microseconds, >>>> - * we repeatedly block for 1-millisecond time periods. >>>> - */ >>>> -#define SRCU_RETRY_CHECK_DELAY 5 >>>> - >>>> /* >>>> * Start an SRCU grace period. >>>> */ >>>> @@ -706,7 +736,7 @@ static void srcu_schedule_cbs_snp(struct >>>> srcu_struct *ssp, struct srcu_node *snp >>>> */ >>>> static void srcu_gp_end(struct srcu_struct *ssp) >>>> { >>>> - unsigned long cbdelay; >>>> + unsigned long cbdelay = 1; >>>> bool cbs; >>>> bool last_lvl; >>>> int cpu; >>>> @@ -726,7 +756,9 @@ static void srcu_gp_end(struct srcu_struct *ssp) >>>> spin_lock_irq_rcu_node(ssp); >>>> idx = rcu_seq_state(ssp->srcu_gp_seq); >>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); >>>> - cbdelay = !!srcu_get_delay(ssp); >>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), >>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp))) >>>> + cbdelay = 0; >>>> + >>>> WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns()); >>>> rcu_seq_end(&ssp->srcu_gp_seq); >>>> gpseq = rcu_seq_current(&ssp->srcu_gp_seq); >>>> @@ -927,12 +959,16 @@ static void srcu_funnel_gp_start(struct >>>> srcu_struct *ssp, struct srcu_data *sdp, >>>> */ >>>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int >>>> trycount) >>>> { >>>> + unsigned long curdelay; >>>> + >>>> + curdelay = !srcu_get_delay(ssp); >>>> + >>>> for (;;) { >>>> if (srcu_readers_active_idx_check(ssp, idx)) >>>> return true; >>>> - if (--trycount + !srcu_get_delay(ssp) <= 0) >>>> + if ((--trycount + curdelay) <= 0) >>>> return false; >>>> - udelay(SRCU_RETRY_CHECK_DELAY); >>>> + udelay(srcu_retry_check_delay); >>>> } >>>> } >>>> @@ -1588,7 +1624,7 @@ static void process_srcu(struct work_struct >>>> *work) >>>> j = jiffies; >>>> if (READ_ONCE(ssp->reschedule_jiffies) == j) { >>>> WRITE_ONCE(ssp->reschedule_count, >>>> READ_ONCE(ssp->reschedule_count) + 1); >>>> - if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY) >>>> + if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay) >>>> curdelay = 1; >>>> } else { >>>> WRITE_ONCE(ssp->reschedule_count, 1); >>>> @@ -1680,6 +1716,11 @@ static int __init srcu_bootup_announce(void) >>>> pr_info("Hierarchical SRCU implementation.\n"); >>>> if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF) >>>> pr_info("\tNon-default auto-expedite holdoff of %lu >>>> ns.\n", exp_holdoff); >>>> + if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY) >>>> + pr_info("\tNon-default retry check delay of %lu us.\n", >>>> srcu_retry_check_delay); >>>> + if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY) >>>> + pr_info("\tNon-default max no-delay of %lu.\n", >>>> srcu_max_nodelay); >>>> + pr_info("\tMax phase no-delay instances is %lu.\n", >>>> srcu_max_nodelay_phase); >>>> return 0; >>>> } >>>> early_initcall(srcu_bootup_announce); >>> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-27 12:37 [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further Neeraj Upadhyay 2022-06-27 22:45 ` Paul E. McKenney 2022-06-28 2:14 ` Zhangfei Gao @ 2022-06-28 9:02 ` Marc Zyngier 2022-06-28 9:17 ` Neeraj Upadhyay 2 siblings, 1 reply; 15+ messages in thread From: Marc Zyngier @ 2022-06-28 9:02 UTC (permalink / raw) To: Neeraj Upadhyay Cc: paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66 On Mon, 27 Jun 2022 13:37:06 +0100, Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: > > Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > grace periods") highlights a problem where aggressively blocking > SRCU expedited grace periods, as was introduced in commit > 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > from consuming CPU"), introduces ~2 minutes delay to the overall > ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > cmdline on qemu, which results in very high rate of memslots > add/remove, which causes > ~6000 synchronize_srcu() calls for > kvm->srcu SRCU instance. > > Below table captures the experiments done by Zhangfei Gao, Shameer, > to measure the boottime impact with various values of non-sleeping > per phase counts, with HZ_250 and preemption enabled: > > +──────────────────────────+────────────────+ > | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > +──────────────────────────+────────────────+ > | 100 | 30.053 | > | 150 | 25.151 | > | 200 | 20.704 | > | 250 | 15.748 | > | 500 | 11.401 | > | 1000 | 11.443 | > | 10000 | 11.258 | > | 1000000 | 11.154 | > +──────────────────────────+────────────────+ > > Analysis on the experiment results showed improved boot time > with non blocking delays close to one jiffy duration. This > was also seen when number of per-phase iterations were scaled > to one jiffy. > > So, this change scales per-grace-period phase number of non-sleeping > polls, soiuch that, non-sleeping polls are done for one jiffy. In addition > to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate > the delay used for scheduling callbacks, is replaced with the check for > expedited grace period. This is done, to schedule cbs for completed expedited > grace periods immediately, which results in improved boot time seen in > experiments. > > In addition to the changes to default per phase delays, this change > adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > This allows users to configure the srcu grace period scanning delays, > depending on their system configuration requirements. > > Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> I've given this a go on one of my test platforms (the one I noticed the issue on the first place), and found that the initial part of the EFI boot under KVM (pointlessly wiping the emulated flash) went down to 1m7s from 3m50s (HZ=250). Clearly a massive improvement, but still a far cry from the original ~40s (yes, this box is utter crap -- which is why I use it). Anyway: Tested-by: Marc Zyngier <maz@kernel.org> M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 9:02 ` Marc Zyngier @ 2022-06-28 9:17 ` Neeraj Upadhyay 2022-06-28 9:31 ` Marc Zyngier 0 siblings, 1 reply; 15+ messages in thread From: Neeraj Upadhyay @ 2022-06-28 9:17 UTC (permalink / raw) To: Marc Zyngier Cc: paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66 On 6/28/2022 2:32 PM, Marc Zyngier wrote: > On Mon, 27 Jun 2022 13:37:06 +0100, > Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: >> >> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >> grace periods") highlights a problem where aggressively blocking >> SRCU expedited grace periods, as was introduced in commit >> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >> from consuming CPU"), introduces ~2 minutes delay to the overall >> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >> cmdline on qemu, which results in very high rate of memslots >> add/remove, which causes > ~6000 synchronize_srcu() calls for >> kvm->srcu SRCU instance. >> >> Below table captures the experiments done by Zhangfei Gao, Shameer, >> to measure the boottime impact with various values of non-sleeping >> per phase counts, with HZ_250 and preemption enabled: >> >> +──────────────────────────+────────────────+ >> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >> +──────────────────────────+────────────────+ >> | 100 | 30.053 | >> | 150 | 25.151 | >> | 200 | 20.704 | >> | 250 | 15.748 | >> | 500 | 11.401 | >> | 1000 | 11.443 | >> | 10000 | 11.258 | >> | 1000000 | 11.154 | >> +──────────────────────────+────────────────+ >> >> Analysis on the experiment results showed improved boot time >> with non blocking delays close to one jiffy duration. This >> was also seen when number of per-phase iterations were scaled >> to one jiffy. >> >> So, this change scales per-grace-period phase number of non-sleeping >> polls, soiuch that, non-sleeping polls are done for one jiffy. In addition >> to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate >> the delay used for scheduling callbacks, is replaced with the check for >> expedited grace period. This is done, to schedule cbs for completed expedited >> grace periods immediately, which results in improved boot time seen in >> experiments. >> >> In addition to the changes to default per phase delays, this change >> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >> This allows users to configure the srcu grace period scanning delays, >> depending on their system configuration requirements. >> >> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > I've given this a go on one of my test platforms (the one I noticed > the issue on the first place), and found that the initial part of the > EFI boot under KVM (pointlessly wiping the emulated flash) went down > to 1m7s from 3m50s (HZ=250). > > Clearly a massive improvement, but still a far cry from the original > ~40s (yes, this box is utter crap -- which is why I use it). Do you see any improvement by using "srcutree.srcu_max_nodelay=1000" bootarg, on top of this patch? > > Anyway: > > Tested-by: Marc Zyngier <maz@kernel.org> Thanks! Thanks Neeraj > > M. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 9:17 ` Neeraj Upadhyay @ 2022-06-28 9:31 ` Marc Zyngier 2022-06-28 10:02 ` Neeraj Upadhyay 2022-06-28 13:46 ` Paul E. McKenney 0 siblings, 2 replies; 15+ messages in thread From: Marc Zyngier @ 2022-06-28 9:31 UTC (permalink / raw) To: Neeraj Upadhyay Cc: paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66 On Tue, 28 Jun 2022 10:17:24 +0100, Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: > > > > On 6/28/2022 2:32 PM, Marc Zyngier wrote: > > On Mon, 27 Jun 2022 13:37:06 +0100, > > Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: > >> > >> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > >> grace periods") highlights a problem where aggressively blocking > >> SRCU expedited grace periods, as was introduced in commit > >> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > >> from consuming CPU"), introduces ~2 minutes delay to the overall > >> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > >> cmdline on qemu, which results in very high rate of memslots > >> add/remove, which causes > ~6000 synchronize_srcu() calls for > >> kvm->srcu SRCU instance. > >> > >> Below table captures the experiments done by Zhangfei Gao, Shameer, > >> to measure the boottime impact with various values of non-sleeping > >> per phase counts, with HZ_250 and preemption enabled: > >> > >> +──────────────────────────+────────────────+ > >> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > >> +──────────────────────────+────────────────+ > >> | 100 | 30.053 | > >> | 150 | 25.151 | > >> | 200 | 20.704 | > >> | 250 | 15.748 | > >> | 500 | 11.401 | > >> | 1000 | 11.443 | > >> | 10000 | 11.258 | > >> | 1000000 | 11.154 | > >> +──────────────────────────+────────────────+ > >> > >> Analysis on the experiment results showed improved boot time > >> with non blocking delays close to one jiffy duration. This > >> was also seen when number of per-phase iterations were scaled > >> to one jiffy. > >> > >> So, this change scales per-grace-period phase number of non-sleeping > >> polls, soiuch that, non-sleeping polls are done for one jiffy. In addition > >> to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate > >> the delay used for scheduling callbacks, is replaced with the check for > >> expedited grace period. This is done, to schedule cbs for completed expedited > >> grace periods immediately, which results in improved boot time seen in > >> experiments. > >> > >> In addition to the changes to default per phase delays, this change > >> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > >> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > >> This allows users to configure the srcu grace period scanning delays, > >> depending on their system configuration requirements. > >> > >> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > > > I've given this a go on one of my test platforms (the one I noticed > > the issue on the first place), and found that the initial part of the > > EFI boot under KVM (pointlessly wiping the emulated flash) went down > > to 1m7s from 3m50s (HZ=250). > > > > Clearly a massive improvement, but still a far cry from the original > > ~40s (yes, this box is utter crap -- which is why I use it). > > Do you see any improvement by using "srcutree.srcu_max_nodelay=1000" > bootarg, on top of this patch? Yup, this brings it back to 43s on a quick test run, which is close enough to what I had before. How does a random user come up with such a value though? Thanks, M. -- Without deviation from the norm, progress is not possible. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 9:31 ` Marc Zyngier @ 2022-06-28 10:02 ` Neeraj Upadhyay 2022-06-28 13:46 ` Paul E. McKenney 1 sibling, 0 replies; 15+ messages in thread From: Neeraj Upadhyay @ 2022-06-28 10:02 UTC (permalink / raw) To: Marc Zyngier Cc: paulmck, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66 On 6/28/2022 3:01 PM, Marc Zyngier wrote: > On Tue, 28 Jun 2022 10:17:24 +0100, > Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: >> >> >> >> On 6/28/2022 2:32 PM, Marc Zyngier wrote: >>> On Mon, 27 Jun 2022 13:37:06 +0100, >>> Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: >>>> >>>> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited >>>> grace periods") highlights a problem where aggressively blocking >>>> SRCU expedited grace periods, as was introduced in commit >>>> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers >>>> from consuming CPU"), introduces ~2 minutes delay to the overall >>>> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" >>>> cmdline on qemu, which results in very high rate of memslots >>>> add/remove, which causes > ~6000 synchronize_srcu() calls for >>>> kvm->srcu SRCU instance. >>>> >>>> Below table captures the experiments done by Zhangfei Gao, Shameer, >>>> to measure the boottime impact with various values of non-sleeping >>>> per phase counts, with HZ_250 and preemption enabled: >>>> >>>> +──────────────────────────+────────────────+ >>>> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | >>>> +──────────────────────────+────────────────+ >>>> | 100 | 30.053 | >>>> | 150 | 25.151 | >>>> | 200 | 20.704 | >>>> | 250 | 15.748 | >>>> | 500 | 11.401 | >>>> | 1000 | 11.443 | >>>> | 10000 | 11.258 | >>>> | 1000000 | 11.154 | >>>> +──────────────────────────+────────────────+ >>>> >>>> Analysis on the experiment results showed improved boot time >>>> with non blocking delays close to one jiffy duration. This >>>> was also seen when number of per-phase iterations were scaled >>>> to one jiffy. >>>> >>>> So, this change scales per-grace-period phase number of non-sleeping >>>> polls, soiuch that, non-sleeping polls are done for one jiffy. In addition >>>> to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate >>>> the delay used for scheduling callbacks, is replaced with the check for >>>> expedited grace period. This is done, to schedule cbs for completed expedited >>>> grace periods immediately, which results in improved boot time seen in >>>> experiments. >>>> >>>> In addition to the changes to default per phase delays, this change >>>> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, >>>> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. >>>> This allows users to configure the srcu grace period scanning delays, >>>> depending on their system configuration requirements. >>>> >>>> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> >>> >>> I've given this a go on one of my test platforms (the one I noticed >>> the issue on the first place), and found that the initial part of the >>> EFI boot under KVM (pointlessly wiping the emulated flash) went down >>> to 1m7s from 3m50s (HZ=250). >>> >>> Clearly a massive improvement, but still a far cry from the original >>> ~40s (yes, this box is utter crap -- which is why I use it). >> >> Do you see any improvement by using "srcutree.srcu_max_nodelay=1000" >> bootarg, on top of this patch? > > Yup, this brings it back to 43s on a quick test run, which is close > enough to what I had before. > Cool, thanks! > How does a random user come up with such a value though? > It need to be tuned :) The patch actually adds 2 jiffies (vs the one jiffy mentioned in the commit log) of non-sleep per phase delay. Each phase iteration uses a delay of 10 us. So, for CONFIG_HZ_250 its around 800 iterations. Thanks Neeraj > Thanks, > > M. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further 2022-06-28 9:31 ` Marc Zyngier 2022-06-28 10:02 ` Neeraj Upadhyay @ 2022-06-28 13:46 ` Paul E. McKenney 1 sibling, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2022-06-28 13:46 UTC (permalink / raw) To: Marc Zyngier Cc: Neeraj Upadhyay, frederic, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, linux-kernel, zhangfei.gao, boqun.feng, urezki, shameerali.kolothum.thodi, pbonzini, mtosatti, eric.auger, chenxiang66 On Tue, Jun 28, 2022 at 10:31:54AM +0100, Marc Zyngier wrote: > On Tue, 28 Jun 2022 10:17:24 +0100, > Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: > > > > > > > > On 6/28/2022 2:32 PM, Marc Zyngier wrote: > > > On Mon, 27 Jun 2022 13:37:06 +0100, > > > Neeraj Upadhyay <quic_neeraju@quicinc.com> wrote: > > >> > > >> Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited > > >> grace periods") highlights a problem where aggressively blocking > > >> SRCU expedited grace periods, as was introduced in commit > > >> 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers > > >> from consuming CPU"), introduces ~2 minutes delay to the overall > > >> ~3.5 minutes boot time, when starting VMs with "-bios QEMU_EFI.fd" > > >> cmdline on qemu, which results in very high rate of memslots > > >> add/remove, which causes > ~6000 synchronize_srcu() calls for > > >> kvm->srcu SRCU instance. > > >> > > >> Below table captures the experiments done by Zhangfei Gao, Shameer, > > >> to measure the boottime impact with various values of non-sleeping > > >> per phase counts, with HZ_250 and preemption enabled: > > >> > > >> +──────────────────────────+────────────────+ > > >> | SRCU_MAX_NODELAY_PHASE | Boot time (s) | > > >> +──────────────────────────+────────────────+ > > >> | 100 | 30.053 | > > >> | 150 | 25.151 | > > >> | 200 | 20.704 | > > >> | 250 | 15.748 | > > >> | 500 | 11.401 | > > >> | 1000 | 11.443 | > > >> | 10000 | 11.258 | > > >> | 1000000 | 11.154 | > > >> +──────────────────────────+────────────────+ > > >> > > >> Analysis on the experiment results showed improved boot time > > >> with non blocking delays close to one jiffy duration. This > > >> was also seen when number of per-phase iterations were scaled > > >> to one jiffy. > > >> > > >> So, this change scales per-grace-period phase number of non-sleeping > > >> polls, soiuch that, non-sleeping polls are done for one jiffy. In addition > > >> to this, srcu_get_delay() call in srcu_gp_end(), which is used to calculate > > >> the delay used for scheduling callbacks, is replaced with the check for > > >> expedited grace period. This is done, to schedule cbs for completed expedited > > >> grace periods immediately, which results in improved boot time seen in > > >> experiments. > > >> > > >> In addition to the changes to default per phase delays, this change > > >> adds 3 new kernel parameters - srcutree.srcu_max_nodelay, > > >> srcutree.srcu_max_nodelay_phase, srcutree.srcu_retry_check_delay. > > >> This allows users to configure the srcu grace period scanning delays, > > >> depending on their system configuration requirements. > > >> > > >> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > > > > > I've given this a go on one of my test platforms (the one I noticed > > > the issue on the first place), and found that the initial part of the > > > EFI boot under KVM (pointlessly wiping the emulated flash) went down > > > to 1m7s from 3m50s (HZ=250). > > > > > > Clearly a massive improvement, but still a far cry from the original > > > ~40s (yes, this box is utter crap -- which is why I use it). > > > > Do you see any improvement by using "srcutree.srcu_max_nodelay=1000" > > bootarg, on top of this patch? > > Yup, this brings it back to 43s on a quick test run, which is close > enough to what I had before. > > How does a random user come up with such a value though? There was some talk of moving from synchronize_srcu_expedited() to call_srcu() with the occasional srcu_barrier() to avoid OOM. If that proves to be practical, that should get decent performance with little tuning. But in the meantime, we need to avoid hangs due to CPU-bound tasks in one workload while still avoiding massive boot-time slowdowns in your workload. Right now, Neeraj's carefully tuned approach is the one way we know to square this particular circle. Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-06-28 13:47 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-06-27 12:37 [PATCH] srcu: Reduce blocking agressiveness of expedited grace periods further Neeraj Upadhyay 2022-06-27 22:45 ` Paul E. McKenney 2022-06-28 3:15 ` Neeraj Upadhyay 2022-06-28 4:26 ` Paul E. McKenney 2022-06-28 2:14 ` Zhangfei Gao 2022-06-28 3:22 ` Neeraj Upadhyay 2022-06-28 4:03 ` Zhangfei Gao 2022-06-28 4:24 ` Paul E. McKenney 2022-06-28 4:26 ` Neeraj Upadhyay 2022-06-28 9:19 ` Neeraj Upadhyay 2022-06-28 9:02 ` Marc Zyngier 2022-06-28 9:17 ` Neeraj Upadhyay 2022-06-28 9:31 ` Marc Zyngier 2022-06-28 10:02 ` Neeraj Upadhyay 2022-06-28 13:46 ` Paul E. McKenney
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.