* Re: [patch 1/2] rcu batch tuning
@ 2006-01-27 19:57 Oleg Nesterov
2006-01-27 23:42 ` Paul E. McKenney
2006-01-28 17:08 ` Dipankar Sarma
0 siblings, 2 replies; 12+ messages in thread
From: Oleg Nesterov @ 2006-01-27 19:57 UTC (permalink / raw)
To: Dipankar Sarma
Cc: linux-kernel, Paul E. McKenney, Andrew Morton, Linus Torvalds
Dipankar Sarma wrote:
>
> +static void force_quiescent_state(struct rcu_data *rdp,
> + struct rcu_ctrlblk *rcp)
> +{
> + int cpu;
> + cpumask_t cpumask;
> + set_need_resched();
> + if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
> + rdp->last_rs_qlen = rdp->qlen;
> + /*
> + * Don't send IPI to itself. With irqs disabled,
> + * rdp->cpu is the current cpu.
> + */
> + cpumask = rcp->cpumask;
> + cpu_clear(rdp->cpu, cpumask);
> + for_each_cpu_mask(cpu, cpumask)
> + smp_send_reschedule(cpu);
> + }
> +}
> [ ... snip ... ]
>
> @@ -92,17 +128,13 @@ void fastcall call_rcu(struct rcu_head *
> rdp = &__get_cpu_var(rcu_data);
> *rdp->nxttail = head;
> rdp->nxttail = &head->next;
> -
> - if (unlikely(++rdp->count > 10000))
> - set_need_resched();
> -
> + if (unlikely(++rdp->qlen > qhimark)) {
> + rdp->blimit = INT_MAX;
> + force_quiescent_state(rdp, &rcu_ctrlblk);
> + }
When ->qlen exceeds qhimark for the first time we send reschedule IPI to
other CPUs and force_quiescent_state() records ->last_rs_qlen = ->qlen.
But we don't reset ->last_rs_qlen when ->qlen goes to 0, this means that
next time we need ++rdp->qlen > qhimark + rsinterval to force other CPUS
to pass quiescent state, no?
Also, it seems to me it's better to have 2 counters, one for length(->donelist)
and another for length(->curlist + ->nxtlist). I think we don't need
force_quiescent_state() when all rcu callbacks are placed in ->donelist,
we only need to increase rdp->blimit in this case.
Oleg.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [patch 1/2] rcu batch tuning 2006-01-27 19:57 [patch 1/2] rcu batch tuning Oleg Nesterov @ 2006-01-27 23:42 ` Paul E. McKenney 2006-01-28 18:07 ` Oleg Nesterov 2006-01-28 17:08 ` Dipankar Sarma 1 sibling, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2006-01-27 23:42 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Dipankar Sarma, linux-kernel, Andrew Morton, Linus Torvalds On Fri, Jan 27, 2006 at 10:57:59PM +0300, Oleg Nesterov wrote: > Dipankar Sarma wrote: > > > > +static void force_quiescent_state(struct rcu_data *rdp, > > + struct rcu_ctrlblk *rcp) > > +{ > > + int cpu; > > + cpumask_t cpumask; > > + set_need_resched(); > > + if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) { > > + rdp->last_rs_qlen = rdp->qlen; > > + /* > > + * Don't send IPI to itself. With irqs disabled, > > + * rdp->cpu is the current cpu. > > + */ > > + cpumask = rcp->cpumask; > > + cpu_clear(rdp->cpu, cpumask); > > + for_each_cpu_mask(cpu, cpumask) > > + smp_send_reschedule(cpu); > > + } > > +} > > [ ... snip ... ] > > > > @@ -92,17 +128,13 @@ void fastcall call_rcu(struct rcu_head * > > rdp = &__get_cpu_var(rcu_data); > > *rdp->nxttail = head; > > rdp->nxttail = &head->next; > > - > > - if (unlikely(++rdp->count > 10000)) > > - set_need_resched(); > > - > > + if (unlikely(++rdp->qlen > qhimark)) { > > + rdp->blimit = INT_MAX; > > + force_quiescent_state(rdp, &rcu_ctrlblk); > > + } > > When ->qlen exceeds qhimark for the first time we send reschedule IPI to > other CPUs and force_quiescent_state() records ->last_rs_qlen = ->qlen. > But we don't reset ->last_rs_qlen when ->qlen goes to 0, this means that > next time we need ++rdp->qlen > qhimark + rsinterval to force other CPUS > to pass quiescent state, no? Good catch -- this could well explain Lee's continuing to hit latency problems. Although this would not cause the first latency event, only subsequent ones, it seems to me that ->last_rs_qlen should be reset whenever ->blimit is reset. Dipankar, thoughts? > Also, it seems to me it's better to have 2 counters, one for length(->donelist) > and another for length(->curlist + ->nxtlist). I think we don't need > force_quiescent_state() when all rcu callbacks are placed in ->donelist, > we only need to increase rdp->blimit in this case. True, currently the patch keeps the sum of the length of all three lists, and takes both actions when the sum gets too large. But the only way you would get unneeded IPIs would be if callback processing was stalled, but callback generation and grace-period processing was still proceeding. Seems at first glance to be an unusual corner case, with the only downside being some extra IPIs. Or am I missing some aspect? Thanx, Paul ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] rcu batch tuning 2006-01-27 23:42 ` Paul E. McKenney @ 2006-01-28 18:07 ` Oleg Nesterov 2006-01-30 3:30 ` Paul E. McKenney 0 siblings, 1 reply; 12+ messages in thread From: Oleg Nesterov @ 2006-01-28 18:07 UTC (permalink / raw) To: paulmck; +Cc: Dipankar Sarma, linux-kernel, Andrew Morton, Linus Torvalds "Paul E. McKenney" wrote: > > On Fri, Jan 27, 2006 at 10:57:59PM +0300, Oleg Nesterov wrote: > > > > When ->qlen exceeds qhimark for the first time we send reschedule IPI to > > other CPUs and force_quiescent_state() records ->last_rs_qlen = ->qlen. > > But we don't reset ->last_rs_qlen when ->qlen goes to 0, this means that > > next time we need ++rdp->qlen > qhimark + rsinterval to force other CPUS > > to pass quiescent state, no? > > Good catch -- this could well explain Lee's continuing to hit > latency problems. Although this would not cause the first > latency event, only subsequent ones, it seems to me that ->last_rs_qlen > should be reset whenever ->blimit is reset. May be it's better to do it in other way? struct rcu_ctrlblk { ... int signaled; ... }; void force_quiescent_state(rdp, rcp) { if (!rcp->signaled) { // racy, but tolerable rcp->signaled = 1; for_each_cpu_mask(cpu, cpumask) smp_send_reschedule(cpu); } } void rcu_start_batch(rcp, rdp) { if (->next_pending && ->completed == ->cur) { ... rcp->signaled = 0; ... } } Probably it is also makes sense to tasklet_schedule(rcu_tasklet) in call_rcu() when ++rdp->qlen > qhimark, this way we can detect that we need to start the next batch earlier. > > Also, it seems to me it's better to have 2 counters, one for length(->donelist) > > and another for length(->curlist + ->nxtlist). I think we don't need > > force_quiescent_state() when all rcu callbacks are placed in ->donelist, > > we only need to increase rdp->blimit in this case. > > True, currently the patch keeps the sum of the length of all three lists, > and takes both actions when the sum gets too large. But the only way > you would get unneeded IPIs would be if callback processing was > stalled, but callback generation and grace-period processing was > still proceeding. Seems at first glance to be an unusual corner > case, with the only downside being some extra IPIs. Or am I missing > some aspect? Yes, it is probably not worth to complicate the code. Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] rcu batch tuning 2006-01-28 18:07 ` Oleg Nesterov @ 2006-01-30 3:30 ` Paul E. McKenney 0 siblings, 0 replies; 12+ messages in thread From: Paul E. McKenney @ 2006-01-30 3:30 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Dipankar Sarma, linux-kernel, Andrew Morton, Linus Torvalds On Sat, Jan 28, 2006 at 09:07:13PM +0300, Oleg Nesterov wrote: > "Paul E. McKenney" wrote: > > > > On Fri, Jan 27, 2006 at 10:57:59PM +0300, Oleg Nesterov wrote: > > > > > > When ->qlen exceeds qhimark for the first time we send reschedule IPI to > > > other CPUs and force_quiescent_state() records ->last_rs_qlen = ->qlen. > > > But we don't reset ->last_rs_qlen when ->qlen goes to 0, this means that > > > next time we need ++rdp->qlen > qhimark + rsinterval to force other CPUS > > > to pass quiescent state, no? > > > > Good catch -- this could well explain Lee's continuing to hit > > latency problems. Although this would not cause the first > > latency event, only subsequent ones, it seems to me that ->last_rs_qlen > > should be reset whenever ->blimit is reset. > > May be it's better to do it in other way? > > struct rcu_ctrlblk { > ... > int signaled; > ... > }; > > void force_quiescent_state(rdp, rcp) > { > if (!rcp->signaled) { > // racy, but tolerable > rcp->signaled = 1; > > for_each_cpu_mask(cpu, cpumask) > smp_send_reschedule(cpu); > } > } > > void rcu_start_batch(rcp, rdp) > { > if (->next_pending && ->completed == ->cur) { > ... > rcp->signaled = 0; > ... > } > } Possibly... But the best thing would be for you and Dipankar to get together to work out the best strategy for this. Thanx, Paul > Probably it is also makes sense to tasklet_schedule(rcu_tasklet) > in call_rcu() when ++rdp->qlen > qhimark, this way we can detect > that we need to start the next batch earlier. > > > > Also, it seems to me it's better to have 2 counters, one for length(->donelist) > > > and another for length(->curlist + ->nxtlist). I think we don't need > > > force_quiescent_state() when all rcu callbacks are placed in ->donelist, > > > we only need to increase rdp->blimit in this case. > > > > True, currently the patch keeps the sum of the length of all three lists, > > and takes both actions when the sum gets too large. But the only way > > you would get unneeded IPIs would be if callback processing was > > stalled, but callback generation and grace-period processing was > > still proceeding. Seems at first glance to be an unusual corner > > case, with the only downside being some extra IPIs. Or am I missing > > some aspect? > > Yes, it is probably not worth to complicate the code. > > Oleg. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] rcu batch tuning 2006-01-27 19:57 [patch 1/2] rcu batch tuning Oleg Nesterov 2006-01-27 23:42 ` Paul E. McKenney @ 2006-01-28 17:08 ` Dipankar Sarma 1 sibling, 0 replies; 12+ messages in thread From: Dipankar Sarma @ 2006-01-28 17:08 UTC (permalink / raw) To: Oleg Nesterov Cc: linux-kernel, Paul E. McKenney, Andrew Morton, Linus Torvalds On Fri, Jan 27, 2006 at 10:57:59PM +0300, Oleg Nesterov wrote: > Dipankar Sarma wrote: > > > > @@ -92,17 +128,13 @@ void fastcall call_rcu(struct rcu_head * > > rdp = &__get_cpu_var(rcu_data); > > *rdp->nxttail = head; > > rdp->nxttail = &head->next; > > - > > - if (unlikely(++rdp->count > 10000)) > > - set_need_resched(); > > - > > + if (unlikely(++rdp->qlen > qhimark)) { > > + rdp->blimit = INT_MAX; > > + force_quiescent_state(rdp, &rcu_ctrlblk); > > + } > > When ->qlen exceeds qhimark for the first time we send reschedule IPI to > other CPUs and force_quiescent_state() records ->last_rs_qlen = ->qlen. > But we don't reset ->last_rs_qlen when ->qlen goes to 0, this means that > next time we need ++rdp->qlen > qhimark + rsinterval to force other CPUS > to pass quiescent state, no? Yes. I have fixed it my code. > > Also, it seems to me it's better to have 2 counters, one for length(->donelist) > and another for length(->curlist + ->nxtlist). I think we don't need > force_quiescent_state() when all rcu callbacks are placed in ->donelist, > we only need to increase rdp->blimit in this case. We could, but I am not sure that ->qlen is not a good measure of grace period not happening. In fact, in the past, long donelists often resulted in longer grace periods. So, no harm in forcing reschedule and allowing ksoftirq to relinquish cpu. Thanks Dipankar ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 0/2] RCU updates @ 2006-02-17 15:41 Dipankar Sarma 2006-02-17 15:43 ` [PATCH 1/2] rcu batch tuning Dipankar Sarma 0 siblings, 1 reply; 12+ messages in thread From: Dipankar Sarma @ 2006-02-17 15:41 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Paul E.McKenney Here are the patches that I have been testing that should help some of the latency and OOM issues (file limit) that we had discussed in the past. If the patchset looks ok, we should queue them up in -mm for some testing before merging. I have lightly tested the patchset on both ppc64 and x86_64 using ltp, dbench etc. Update since the last time I published - file counting stuff uses percpu_counter. Thanks Dipankar ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] rcu batch tuning 2006-02-17 15:41 [PATCH 0/2] RCU updates Dipankar Sarma @ 2006-02-17 15:43 ` Dipankar Sarma 2006-02-17 20:33 ` Dipankar Sarma 2006-02-18 8:45 ` Andrew Morton 0 siblings, 2 replies; 12+ messages in thread From: Dipankar Sarma @ 2006-02-17 15:43 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Paul E.McKenney This patch adds new tunables for RCU queue and finished batches. There are two types of controls - number of completed RCU updates invoked in a batch (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark, qlowmark). By default, the per-cpu batch limit is set to a small value. If the input RCU rate exceeds the high watermark, we do two things - force quiescent state on all cpus and set the batch limit of the CPU to INTMAX. Setting batch limit to INTMAX forces all finished RCUs to be processed in one shot. If we have more than INTMAX RCUs queued up, then we have bigger problems anyway. Once the incoming queued RCUs fall below the low watermark, the batch limit is set to the default. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> --- include/linux/rcupdate.h | 6 +++ kernel/rcupdate.c | 76 +++++++++++++++++++++++++++++++++++------------ 2 files changed, 63 insertions(+), 19 deletions(-) diff -puN include/linux/rcupdate.h~rcu-batch-tuning include/linux/rcupdate.h --- linux-2.6.16-rc3-rcu/include/linux/rcupdate.h~rcu-batch-tuning 2006-02-15 16:00:02.000000000 +0530 +++ linux-2.6.16-rc3-rcu-dipankar/include/linux/rcupdate.h 2006-02-15 16:00:02.000000000 +0530 @@ -98,13 +98,17 @@ struct rcu_data { long batch; /* Batch # for current RCU batch */ struct rcu_head *nxtlist; struct rcu_head **nxttail; - long count; /* # of queued items */ + long qlen; /* # of queued callbacks */ struct rcu_head *curlist; struct rcu_head **curtail; struct rcu_head *donelist; struct rcu_head **donetail; + long blimit; /* Upper limit on a processed batch */ int cpu; struct rcu_head barrier; +#ifdef CONFIG_SMP + long last_rs_qlen; /* qlen during the last resched */ +#endif }; DECLARE_PER_CPU(struct rcu_data, rcu_data); diff -puN kernel/rcupdate.c~rcu-batch-tuning kernel/rcupdate.c --- linux-2.6.16-rc3-rcu/kernel/rcupdate.c~rcu-batch-tuning 2006-02-15 16:00:02.000000000 +0530 +++ linux-2.6.16-rc3-rcu-dipankar/kernel/rcupdate.c 2006-02-15 16:00:02.000000000 +0530 @@ -67,7 +67,43 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_d /* Fake initialization required by compiler */ static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; -static int maxbatch = 10000; +static int blimit = 10; +static int qhimark = 10000; +static int qlowmark = 100; +#ifdef CONFIG_SMP +static int rsinterval = 1000; +#endif + +static atomic_t rcu_barrier_cpu_count; +static struct semaphore rcu_barrier_sema; +static struct completion rcu_barrier_completion; + +#ifdef CONFIG_SMP +static void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + int cpu; + cpumask_t cpumask; + set_need_resched(); + if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) { + rdp->last_rs_qlen = rdp->qlen; + /* + * Don't send IPI to itself. With irqs disabled, + * rdp->cpu is the current cpu. + */ + cpumask = rcp->cpumask; + cpu_clear(rdp->cpu, cpumask); + for_each_cpu_mask(cpu, cpumask) + smp_send_reschedule(cpu); + } +} +#else +static inline void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + set_need_resched(); +} +#endif /** * call_rcu - Queue an RCU callback for invocation after a grace period. @@ -92,17 +128,13 @@ void fastcall call_rcu(struct rcu_head * rdp = &__get_cpu_var(rcu_data); *rdp->nxttail = head; rdp->nxttail = &head->next; - - if (unlikely(++rdp->count > 10000)) - set_need_resched(); - + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_ctrlblk); + } local_irq_restore(flags); } -static atomic_t rcu_barrier_cpu_count; -static struct semaphore rcu_barrier_sema; -static struct completion rcu_barrier_completion; - /** * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. * @head: structure to be used for queueing the RCU updates. @@ -131,12 +163,12 @@ void fastcall call_rcu_bh(struct rcu_hea rdp = &__get_cpu_var(rcu_bh_data); *rdp->nxttail = head; rdp->nxttail = &head->next; - rdp->count++; -/* - * Should we directly call rcu_do_batch() here ? - * if (unlikely(rdp->count > 10000)) - * rcu_do_batch(rdp); - */ + + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_bh_ctrlblk); + } + local_irq_restore(flags); } @@ -199,10 +231,12 @@ static void rcu_do_batch(struct rcu_data next = rdp->donelist = list->next; list->func(list); list = next; - rdp->count--; - if (++count >= maxbatch) + rdp->qlen--; + if (++count >= rdp->blimit) break; } + if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark) + rdp->blimit = blimit; if (!rdp->donelist) rdp->donetail = &rdp->donelist; else @@ -473,6 +507,7 @@ static void rcu_init_percpu_data(int cpu rdp->quiescbatch = rcp->completed; rdp->qs_pending = 0; rdp->cpu = cpu; + rdp->blimit = blimit; } static void __devinit rcu_online_cpu(int cpu) @@ -567,7 +602,12 @@ void synchronize_kernel(void) synchronize_rcu(); } -module_param(maxbatch, int, 0); +module_param(blimit, int, 0); +module_param(qhimark, int, 0); +module_param(qlowmark, int, 0); +#ifdef CONFIG_SMP +module_param(rsinterval, int, 0); +#endif EXPORT_SYMBOL_GPL(rcu_batches_completed); EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */ EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */ _ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] rcu batch tuning 2006-02-17 15:43 ` [PATCH 1/2] rcu batch tuning Dipankar Sarma @ 2006-02-17 20:33 ` Dipankar Sarma 2006-02-18 8:45 ` Andrew Morton 1 sibling, 0 replies; 12+ messages in thread From: Dipankar Sarma @ 2006-02-17 20:33 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Paul E.McKenney Andrew, The earlier version of this patch was against 2.6.16-rc3 and didn't apply cleanly to 2.6.16-rc3-mm1. This version applies cleanly to 2.6.16-rc3-mm1. This patch adds new tunables for RCU queue and finished batches. There are two types of controls - number of completed RCU updates invoked in a batch (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark, qlowmark). By default, the per-cpu batch limit is set to a small value. If the input RCU rate exceeds the high watermark, we do two things - force quiescent state on all cpus and set the batch limit of the CPU to INTMAX. Setting batch limit to INTMAX forces all finished RCUs to be processed in one shot. If we have more than INTMAX RCUs queued up, then we have bigger problems anyway. Once the incoming queued RCUs fall below the low watermark, the batch limit is set to the default. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> --- include/linux/rcupdate.h | 6 +++ kernel/rcupdate.c | 74 ++++++++++++++++++++++++++++++++++++----------- 2 files changed, 63 insertions(+), 17 deletions(-) diff -puN include/linux/rcupdate.h~rcu-batch-tuning include/linux/rcupdate.h --- linux-2.6.16-rc3-mm1-rcu/include/linux/rcupdate.h~rcu-batch-tuning 2006-02-18 01:18:24.000000000 +0530 +++ linux-2.6.16-rc3-mm1-rcu-dipankar/include/linux/rcupdate.h 2006-02-18 01:18:24.000000000 +0530 @@ -98,13 +98,17 @@ struct rcu_data { long batch; /* Batch # for current RCU batch */ struct rcu_head *nxtlist; struct rcu_head **nxttail; - long count; /* # of queued items */ + long qlen; /* # of queued callbacks */ struct rcu_head *curlist; struct rcu_head **curtail; struct rcu_head *donelist; struct rcu_head **donetail; + long blimit; /* Upper limit on a processed batch */ int cpu; struct rcu_head barrier; +#ifdef CONFIG_SMP + long last_rs_qlen; /* qlen during the last resched */ +#endif }; DECLARE_PER_CPU(struct rcu_data, rcu_data); diff -puN kernel/rcupdate.c~rcu-batch-tuning kernel/rcupdate.c --- linux-2.6.16-rc3-mm1-rcu/kernel/rcupdate.c~rcu-batch-tuning 2006-02-18 01:18:24.000000000 +0530 +++ linux-2.6.16-rc3-mm1-rcu-dipankar/kernel/rcupdate.c 2006-02-18 01:24:07.000000000 +0530 @@ -68,7 +68,43 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_d /* Fake initialization required by compiler */ static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; -static int maxbatch = 10000; +static int blimit = 10; +static int qhimark = 10000; +static int qlowmark = 100; +#ifdef CONFIG_SMP +static int rsinterval = 1000; +#endif + +static atomic_t rcu_barrier_cpu_count; +static DEFINE_MUTEX(rcu_barrier_mutex); +static struct completion rcu_barrier_completion; + +#ifdef CONFIG_SMP +static void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + int cpu; + cpumask_t cpumask; + set_need_resched(); + if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) { + rdp->last_rs_qlen = rdp->qlen; + /* + * Don't send IPI to itself. With irqs disabled, + * rdp->cpu is the current cpu. + */ + cpumask = rcp->cpumask; + cpu_clear(rdp->cpu, cpumask); + for_each_cpu_mask(cpu, cpumask) + smp_send_reschedule(cpu); + } +} +#else +static inline void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + set_need_resched(); +} +#endif /** * call_rcu - Queue an RCU callback for invocation after a grace period. @@ -94,16 +130,14 @@ void fastcall call_rcu(struct rcu_head * *rdp->nxttail = head; rdp->nxttail = &head->next; - if (unlikely(++rdp->count > 10000)) - set_need_resched(); + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_ctrlblk); + } local_irq_restore(flags); } -static atomic_t rcu_barrier_cpu_count; -static DEFINE_MUTEX(rcu_barrier_mutex); -static struct completion rcu_barrier_completion; - /** * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. * @head: structure to be used for queueing the RCU updates. @@ -132,12 +166,12 @@ void fastcall call_rcu_bh(struct rcu_hea rdp = &__get_cpu_var(rcu_bh_data); *rdp->nxttail = head; rdp->nxttail = &head->next; - rdp->count++; -/* - * Should we directly call rcu_do_batch() here ? - * if (unlikely(rdp->count > 10000)) - * rcu_do_batch(rdp); - */ + + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_bh_ctrlblk); + } + local_irq_restore(flags); } @@ -200,10 +234,12 @@ static void rcu_do_batch(struct rcu_data next = rdp->donelist = list->next; list->func(list); list = next; - rdp->count--; - if (++count >= maxbatch) + rdp->qlen--; + if (++count >= rdp->blimit) break; } + if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark) + rdp->blimit = blimit; if (!rdp->donelist) rdp->donetail = &rdp->donelist; else @@ -473,6 +509,7 @@ static void rcu_init_percpu_data(int cpu rdp->quiescbatch = rcp->completed; rdp->qs_pending = 0; rdp->cpu = cpu; + rdp->blimit = blimit; } static void __devinit rcu_online_cpu(int cpu) @@ -566,7 +603,12 @@ void synchronize_kernel(void) synchronize_rcu(); } -module_param(maxbatch, int, 0); +module_param(blimit, int, 0); +module_param(qhimark, int, 0); +module_param(qlowmark, int, 0); +#ifdef CONFIG_SMP +module_param(rsinterval, int, 0); +#endif EXPORT_SYMBOL_GPL(rcu_batches_completed); EXPORT_SYMBOL_GPL_FUTURE(call_rcu); /* WARNING: GPL-only in April 2006. */ EXPORT_SYMBOL_GPL_FUTURE(call_rcu_bh); /* WARNING: GPL-only in April 2006. */ _ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] rcu batch tuning 2006-02-17 15:43 ` [PATCH 1/2] rcu batch tuning Dipankar Sarma 2006-02-17 20:33 ` Dipankar Sarma @ 2006-02-18 8:45 ` Andrew Morton 2006-02-18 9:15 ` Dipankar Sarma 1 sibling, 1 reply; 12+ messages in thread From: Andrew Morton @ 2006-02-18 8:45 UTC (permalink / raw) To: dipankar; +Cc: linux-kernel, paulmck Dipankar Sarma <dipankar@in.ibm.com> wrote: > > -module_param(maxbatch, int, 0); > +module_param(blimit, int, 0); > +module_param(qhimark, int, 0); > +module_param(qlowmark, int, 0); > +#ifdef CONFIG_SMP > +module_param(rsinterval, int, 0); > +#endif It's a bit unusual to add boot-time tunables via module_param, but there's no law against it. But you do get arrested for not adding them to Documentation/kernel-parameters.txt. That's if you think they're permanent (I hope they aren't). If they are, they'll probably need a more extensive description than kernel-parameters.txt entries normally provide. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] rcu batch tuning 2006-02-18 8:45 ` Andrew Morton @ 2006-02-18 9:15 ` Dipankar Sarma 0 siblings, 0 replies; 12+ messages in thread From: Dipankar Sarma @ 2006-02-18 9:15 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, paulmck On Sat, Feb 18, 2006 at 12:45:51AM -0800, Andrew Morton wrote: > Dipankar Sarma <dipankar@in.ibm.com> wrote: > > > > -module_param(maxbatch, int, 0); > > +module_param(blimit, int, 0); > > +module_param(qhimark, int, 0); > > +module_param(qlowmark, int, 0); > > +#ifdef CONFIG_SMP > > +module_param(rsinterval, int, 0); > > +#endif > > It's a bit unusual to add boot-time tunables via module_param, but there's > no law against it. > > But you do get arrested for not adding them to > Documentation/kernel-parameters.txt. That's if you think they're permanent > (I hope they aren't). If they are, they'll probably need a more extensive > description than kernel-parameters.txt entries normally provide. I hope that we will not need that many tunables eventually. But my theory has been that with widespread use and experiments with the tunables in the event of OOM and latency problems will allow us to figure out what kind of automatic tuning really works. Regardless, I think there should be documentation. I will send a documentation patch. Thanks Dipankar ^ permalink raw reply [flat|nested] 12+ messages in thread
* [patch 0/2] RCU: fix various latency/oom issues @ 2006-01-26 18:40 Dipankar Sarma 2006-01-26 18:41 ` [patch 1/2] rcu batch tuning Dipankar Sarma 0 siblings, 1 reply; 12+ messages in thread From: Dipankar Sarma @ 2006-01-26 18:40 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Paul E.McKenney, linux-kernel Here are the patches that I have been testing that should help some of the latency and OOM issues (file limit) that we had discussed in the past. If the patchset looks ok, we should probably queue them up in -mm for some testing before merging. I have lightly tested the patchset on both ppc64 and x86_64 using ltp, dbench etc. Thanks Dipnkar ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] rcu batch tuning 2006-01-26 18:40 [patch 0/2] RCU: fix various latency/oom issues Dipankar Sarma @ 2006-01-26 18:41 ` Dipankar Sarma 2006-01-26 19:33 ` Paul E. McKenney 0 siblings, 1 reply; 12+ messages in thread From: Dipankar Sarma @ 2006-01-26 18:41 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Paul E.McKenney, linux-kernel This patch adds new tunables for RCU queue and finished batches. There are two types of controls - number of completed RCU updates invoked in a batch (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark, qlowmark). By default, the per-cpu batch limit is set to a small value. If the input RCU rate exceeds the high watermark, we do two things - force quiescent state on all cpus and set the batch limit of the CPU to INTMAX. Setting batch limit to INTMAX forces all finished RCUs to be processed in one shot. If we have more than INTMAX RCUs queued up, then we have bigger problems anyway. Once the incoming queued RCUs fall below the low watermark, the batch limit is set to the default. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> --- include/linux/rcupdate.h | 6 +++ kernel/rcupdate.c | 76 +++++++++++++++++++++++++++++++++++------------ 2 files changed, 63 insertions(+), 19 deletions(-) diff -puN include/linux/rcupdate.h~rcu-batch-tuning include/linux/rcupdate.h --- linux-2.6.16-rc1-rcu/include/linux/rcupdate.h~rcu-batch-tuning 2006-01-25 00:09:54.000000000 +0530 +++ linux-2.6.16-rc1-rcu-dipankar/include/linux/rcupdate.h 2006-01-25 01:07:39.000000000 +0530 @@ -98,13 +98,17 @@ struct rcu_data { long batch; /* Batch # for current RCU batch */ struct rcu_head *nxtlist; struct rcu_head **nxttail; - long count; /* # of queued items */ + long qlen; /* # of queued callbacks */ struct rcu_head *curlist; struct rcu_head **curtail; struct rcu_head *donelist; struct rcu_head **donetail; + long blimit; /* Upper limit on a processed batch */ int cpu; struct rcu_head barrier; +#ifdef CONFIG_SMP + long last_rs_qlen; /* qlen during the last resched */ +#endif }; DECLARE_PER_CPU(struct rcu_data, rcu_data); diff -puN kernel/rcupdate.c~rcu-batch-tuning kernel/rcupdate.c --- linux-2.6.16-rc1-rcu/kernel/rcupdate.c~rcu-batch-tuning 2006-01-25 00:09:54.000000000 +0530 +++ linux-2.6.16-rc1-rcu-dipankar/kernel/rcupdate.c 2006-01-25 23:08:03.000000000 +0530 @@ -67,7 +67,43 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_d /* Fake initialization required by compiler */ static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; -static int maxbatch = 10000; +static int blimit = 10; +static int qhimark = 10000; +static int qlowmark = 100; +#ifdef CONFIG_SMP +static int rsinterval = 1000; +#endif + +static atomic_t rcu_barrier_cpu_count; +static struct semaphore rcu_barrier_sema; +static struct completion rcu_barrier_completion; + +#ifdef CONFIG_SMP +static void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + int cpu; + cpumask_t cpumask; + set_need_resched(); + if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) { + rdp->last_rs_qlen = rdp->qlen; + /* + * Don't send IPI to itself. With irqs disabled, + * rdp->cpu is the current cpu. + */ + cpumask = rcp->cpumask; + cpu_clear(rdp->cpu, cpumask); + for_each_cpu_mask(cpu, cpumask) + smp_send_reschedule(cpu); + } +} +#else +static inline void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + set_need_resched(); +} +#endif /** * call_rcu - Queue an RCU callback for invocation after a grace period. @@ -92,17 +128,13 @@ void fastcall call_rcu(struct rcu_head * rdp = &__get_cpu_var(rcu_data); *rdp->nxttail = head; rdp->nxttail = &head->next; - - if (unlikely(++rdp->count > 10000)) - set_need_resched(); - + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_ctrlblk); + } local_irq_restore(flags); } -static atomic_t rcu_barrier_cpu_count; -static struct semaphore rcu_barrier_sema; -static struct completion rcu_barrier_completion; - /** * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. * @head: structure to be used for queueing the RCU updates. @@ -131,12 +163,12 @@ void fastcall call_rcu_bh(struct rcu_hea rdp = &__get_cpu_var(rcu_bh_data); *rdp->nxttail = head; rdp->nxttail = &head->next; - rdp->count++; -/* - * Should we directly call rcu_do_batch() here ? - * if (unlikely(rdp->count > 10000)) - * rcu_do_batch(rdp); - */ + + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_bh_ctrlblk); + } + local_irq_restore(flags); } @@ -199,10 +231,12 @@ static void rcu_do_batch(struct rcu_data next = rdp->donelist = list->next; list->func(list); list = next; - rdp->count--; - if (++count >= maxbatch) + rdp->qlen--; + if (++count >= rdp->blimit) break; } + if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark) + rdp->blimit = blimit; if (!rdp->donelist) rdp->donetail = &rdp->donelist; else @@ -473,6 +507,7 @@ static void rcu_init_percpu_data(int cpu rdp->quiescbatch = rcp->completed; rdp->qs_pending = 0; rdp->cpu = cpu; + rdp->blimit = blimit; } static void __devinit rcu_online_cpu(int cpu) @@ -567,7 +602,12 @@ void synchronize_kernel(void) synchronize_rcu(); } -module_param(maxbatch, int, 0); +module_param(blimit, int, 0); +module_param(qhimark, int, 0); +module_param(qlowmark, int, 0); +#ifdef CONFIG_SMP +module_param(rsinterval, int, 0); +#endif EXPORT_SYMBOL_GPL(rcu_batches_completed); EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */ EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */ _ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] rcu batch tuning 2006-01-26 18:41 ` [patch 1/2] rcu batch tuning Dipankar Sarma @ 2006-01-26 19:33 ` Paul E. McKenney 2006-01-26 19:42 ` Dipankar Sarma 0 siblings, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2006-01-26 19:33 UTC (permalink / raw) To: Dipankar Sarma; +Cc: Andrew Morton, Linus Torvalds, linux-kernel On Fri, Jan 27, 2006 at 12:11:27AM +0530, Dipankar Sarma wrote: > > This patch adds new tunables for RCU queue and finished batches. > There are two types of controls - number of completed RCU updates > invoked in a batch (blimit) and monitoring for high rate of > incoming RCUs on a cpu (qhimark, qlowmark). By default, > the per-cpu batch limit is set to a small value. If > the input RCU rate exceeds the high watermark, we do two things - > force quiescent state on all cpus and set the batch limit > of the CPU to INTMAX. Setting batch limit to INTMAX forces all > finished RCUs to be processed in one shot. If we have more than > INTMAX RCUs queued up, then we have bigger problems anyway. > Once the incoming queued RCUs fall below the low watermark, the batch limit > is set to the default. Looks good to me! We might have to have more sophisticated adjustment of blimit, but starting simple is definitely the right way to go. Thanx, Paul Acked-by: <paulmck@us.ibm.com> > Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> > --- > > > include/linux/rcupdate.h | 6 +++ > kernel/rcupdate.c | 76 +++++++++++++++++++++++++++++++++++------------ > 2 files changed, 63 insertions(+), 19 deletions(-) > > diff -puN include/linux/rcupdate.h~rcu-batch-tuning include/linux/rcupdate.h > --- linux-2.6.16-rc1-rcu/include/linux/rcupdate.h~rcu-batch-tuning 2006-01-25 00:09:54.000000000 +0530 > +++ linux-2.6.16-rc1-rcu-dipankar/include/linux/rcupdate.h 2006-01-25 01:07:39.000000000 +0530 > @@ -98,13 +98,17 @@ struct rcu_data { > long batch; /* Batch # for current RCU batch */ > struct rcu_head *nxtlist; > struct rcu_head **nxttail; > - long count; /* # of queued items */ > + long qlen; /* # of queued callbacks */ > struct rcu_head *curlist; > struct rcu_head **curtail; > struct rcu_head *donelist; > struct rcu_head **donetail; > + long blimit; /* Upper limit on a processed batch */ > int cpu; > struct rcu_head barrier; > +#ifdef CONFIG_SMP > + long last_rs_qlen; /* qlen during the last resched */ > +#endif > }; > > DECLARE_PER_CPU(struct rcu_data, rcu_data); > diff -puN kernel/rcupdate.c~rcu-batch-tuning kernel/rcupdate.c > --- linux-2.6.16-rc1-rcu/kernel/rcupdate.c~rcu-batch-tuning 2006-01-25 00:09:54.000000000 +0530 > +++ linux-2.6.16-rc1-rcu-dipankar/kernel/rcupdate.c 2006-01-25 23:08:03.000000000 +0530 > @@ -67,7 +67,43 @@ DEFINE_PER_CPU(struct rcu_data, rcu_bh_d > > /* Fake initialization required by compiler */ > static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; > -static int maxbatch = 10000; > +static int blimit = 10; > +static int qhimark = 10000; > +static int qlowmark = 100; > +#ifdef CONFIG_SMP > +static int rsinterval = 1000; > +#endif > + > +static atomic_t rcu_barrier_cpu_count; > +static struct semaphore rcu_barrier_sema; > +static struct completion rcu_barrier_completion; > + > +#ifdef CONFIG_SMP > +static void force_quiescent_state(struct rcu_data *rdp, > + struct rcu_ctrlblk *rcp) > +{ > + int cpu; > + cpumask_t cpumask; > + set_need_resched(); > + if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) { > + rdp->last_rs_qlen = rdp->qlen; > + /* > + * Don't send IPI to itself. With irqs disabled, > + * rdp->cpu is the current cpu. > + */ > + cpumask = rcp->cpumask; > + cpu_clear(rdp->cpu, cpumask); > + for_each_cpu_mask(cpu, cpumask) > + smp_send_reschedule(cpu); > + } > +} > +#else > +static inline void force_quiescent_state(struct rcu_data *rdp, > + struct rcu_ctrlblk *rcp) > +{ > + set_need_resched(); > +} > +#endif > > /** > * call_rcu - Queue an RCU callback for invocation after a grace period. > @@ -92,17 +128,13 @@ void fastcall call_rcu(struct rcu_head * > rdp = &__get_cpu_var(rcu_data); > *rdp->nxttail = head; > rdp->nxttail = &head->next; > - > - if (unlikely(++rdp->count > 10000)) > - set_need_resched(); > - > + if (unlikely(++rdp->qlen > qhimark)) { > + rdp->blimit = INT_MAX; > + force_quiescent_state(rdp, &rcu_ctrlblk); > + } > local_irq_restore(flags); > } > > -static atomic_t rcu_barrier_cpu_count; > -static struct semaphore rcu_barrier_sema; > -static struct completion rcu_barrier_completion; > - > /** > * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. > * @head: structure to be used for queueing the RCU updates. > @@ -131,12 +163,12 @@ void fastcall call_rcu_bh(struct rcu_hea > rdp = &__get_cpu_var(rcu_bh_data); > *rdp->nxttail = head; > rdp->nxttail = &head->next; > - rdp->count++; > -/* > - * Should we directly call rcu_do_batch() here ? > - * if (unlikely(rdp->count > 10000)) > - * rcu_do_batch(rdp); > - */ > + > + if (unlikely(++rdp->qlen > qhimark)) { > + rdp->blimit = INT_MAX; > + force_quiescent_state(rdp, &rcu_bh_ctrlblk); > + } > + > local_irq_restore(flags); > } > > @@ -199,10 +231,12 @@ static void rcu_do_batch(struct rcu_data > next = rdp->donelist = list->next; > list->func(list); > list = next; > - rdp->count--; > - if (++count >= maxbatch) > + rdp->qlen--; > + if (++count >= rdp->blimit) > break; > } > + if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark) > + rdp->blimit = blimit; > if (!rdp->donelist) > rdp->donetail = &rdp->donelist; > else > @@ -473,6 +507,7 @@ static void rcu_init_percpu_data(int cpu > rdp->quiescbatch = rcp->completed; > rdp->qs_pending = 0; > rdp->cpu = cpu; > + rdp->blimit = blimit; > } > > static void __devinit rcu_online_cpu(int cpu) > @@ -567,7 +602,12 @@ void synchronize_kernel(void) > synchronize_rcu(); > } > > -module_param(maxbatch, int, 0); > +module_param(blimit, int, 0); > +module_param(qhimark, int, 0); > +module_param(qlowmark, int, 0); > +#ifdef CONFIG_SMP > +module_param(rsinterval, int, 0); > +#endif > EXPORT_SYMBOL_GPL(rcu_batches_completed); > EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */ > EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */ > > _ > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [patch 1/2] rcu batch tuning 2006-01-26 19:33 ` Paul E. McKenney @ 2006-01-26 19:42 ` Dipankar Sarma 0 siblings, 0 replies; 12+ messages in thread From: Dipankar Sarma @ 2006-01-26 19:42 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Andrew Morton, Linus Torvalds, linux-kernel On Thu, Jan 26, 2006 at 11:33:17AM -0800, Paul E. McKenney wrote: > On Fri, Jan 27, 2006 at 12:11:27AM +0530, Dipankar Sarma wrote: > > > > This patch adds new tunables for RCU queue and finished batches. > > There are two types of controls - number of completed RCU updates > > invoked in a batch (blimit) and monitoring for high rate of > > incoming RCUs on a cpu (qhimark, qlowmark). By default, > > the per-cpu batch limit is set to a small value. If > > the input RCU rate exceeds the high watermark, we do two things - > > force quiescent state on all cpus and set the batch limit > > of the CPU to INTMAX. Setting batch limit to INTMAX forces all > > finished RCUs to be processed in one shot. If we have more than > > INTMAX RCUs queued up, then we have bigger problems anyway. > > Once the incoming queued RCUs fall below the low watermark, the batch limit > > is set to the default. > > Looks good to me! We might have to have more sophisticated adjustment > of blimit, but starting simple is definitely the right way to go. Yes. Once I am set up to do latency measurements, I will look at the possible need for more sophisticated adjustment of blimit. Thanks Dipankar ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2006-02-18 9:16 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-01-27 19:57 [patch 1/2] rcu batch tuning Oleg Nesterov 2006-01-27 23:42 ` Paul E. McKenney 2006-01-28 18:07 ` Oleg Nesterov 2006-01-30 3:30 ` Paul E. McKenney 2006-01-28 17:08 ` Dipankar Sarma -- strict thread matches above, loose matches on Subject: below -- 2006-02-17 15:41 [PATCH 0/2] RCU updates Dipankar Sarma 2006-02-17 15:43 ` [PATCH 1/2] rcu batch tuning Dipankar Sarma 2006-02-17 20:33 ` Dipankar Sarma 2006-02-18 8:45 ` Andrew Morton 2006-02-18 9:15 ` Dipankar Sarma 2006-01-26 18:40 [patch 0/2] RCU: fix various latency/oom issues Dipankar Sarma 2006-01-26 18:41 ` [patch 1/2] rcu batch tuning Dipankar Sarma 2006-01-26 19:33 ` Paul E. McKenney 2006-01-26 19:42 ` Dipankar Sarma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox