* [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
@ 2014-03-26 13:34 Alexander Gordeev
2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Alexander Gordeev @ 2014-03-26 13:34 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Gordeev, Kent Overstreet, Jens Axboe, Shaohua Li,
Nicholas Bellinger, Ingo Molnar, Peter Zijlstra
Hello,
This series is against 3.14.0-rc7.
It is amied to further improve 'percpu_ida' tags locality by taking
into account system's CPU topology when stealing tags. That is try
to steal from a CPU which is 'closest' to the stealing one.
I would not bother to post this, since on several system the change
did not show any improvement, i.e. on such one:
CPU0 attaching sched-domain:
domain 0: span 0,8 level SIBLING
groups: 0 (cpu_power = 588) 8 (cpu_power = 588)
domain 1: span 0-3,8-11 level MC
groups: 0,8 (cpu_power = 1176) 1,9 (cpu_power = 1176) 2,10 (cpu_power = 1176) 3,11 (cpu_power = 1176)
domain 2: span 0-15 level NUMA
groups: 0-3,8-11 (cpu_power = 4704) 4-7,12-15 (cpu_power = 4704)
But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:
CPU5 attaching sched-domain:
domain 0: span 0-5 level MC
groups: 5 (cpu_power = 1023) 0 (cpu_power = 1023) 1 (cpu_power = 1023) 2 (cpu_power = 1023) 3 (cpu_power = 1023) 4 (cpu_power = 1023)
domain 1: span 0-7 level NUMA
groups: 0-5 (cpu_power = 6138) 6-7 (cpu_power = 2046)
CPU6 attaching sched-domain:
domain 0: span 6-7 level MC
groups: 6 (cpu_power = 1023) 7 (cpu_power = 1023)
domain 1: span 0-7 level NUMA
groups: 6-7 (cpu_power = 2046) 0-5 (cpu_power = 6138)
I tested using 'null_blk' device with number of threads equal
to the number of CPUs with each thread affined to one CPU and
not affined, with no difference.
Suggestions are welcomed :)
Thanks!
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Alexander Gordeev (2):
sched: Introduce topology level masks and for_each_tlm() macro
percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags()
include/linux/percpu_ida.h | 1 -
include/linux/sched.h | 5 ++
kernel/sched/core.c | 89 ++++++++++++++++++++++++++++++++++++++++++++
lib/percpu_ida.c | 46 +++++++++-------------
4 files changed, 113 insertions(+), 28 deletions(-)
--
1.7.7.6
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro 2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev @ 2014-03-26 13:34 ` Alexander Gordeev 2014-03-26 13:34 ` [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() Alexander Gordeev ` (2 subsequent siblings) 3 siblings, 0 replies; 26+ messages in thread From: Alexander Gordeev @ 2014-03-26 13:34 UTC (permalink / raw) To: linux-kernel; +Cc: Alexander Gordeev, Ingo Molnar, Peter Zijlstra We have for_each_cpu_mask() (and friends) macro which is used to enumerate CPUs in a given cpumask. Such enumeration walks CPUs in ascending order of their IDs and does not take into account CPU topology of the system. That is, each next CPU could belong to any level of the system topology with regard to the previously iterated CPU. Yet, in some cases such indiscriminate enumeration could gain suboptimal results when for_each_cpu_mask() is used to find- first a CPU that matches certain criteria: if the finding process prefers a CPU to be found as close as possible to the current one (in the same core, package, last level cache etc.) then for_each_cpu_mask() macro alone is not enough. To facilitate convenient topology-aware cpumask enumeration this update introduces a concept of topology level masks - an per-cpu array of cpumask items. Each n-th item in this array contains CPUs on this topology level minus all CPUs on all [0..n-1] previous levels, relatively to this CPU. Additionally, new macro for_each_tlm() is introduced to enumerate the described above topology level masks. As result, a topology-aware enumeration for a arbitrary cpumask 'mask' would look like this: struct cpumask **tlm; int cpu; for_each_tlm(tlm) { for_each_cpu_and(cpu, *tlm, mask) { /* * Check if 'cpu' matches criteria */ } } It is generally recommended to disable preemption when calling for_each_tlm() macro. Although it is not guaranteed the finding process is not migrated to another CPU upon preemption enable, we do not want it moved in the middle of the nested loops causing accesses to the original CPU's topology level masks. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> --- include/linux/sched.h | 5 +++ kernel/sched/core.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a781dec..489df53 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2934,4 +2934,9 @@ static inline unsigned long rlimit_max(unsigned int limit) return task_rlimit_max(current, limit); } +DECLARE_PER_CPU(struct cpumask **, sd_tlm); + +#define for_each_tlm(tlm) \ + for ((tlm) = this_cpu_read(sd_tlm); *(tlm); (tlm)++) + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6edbef2..da6d119 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5451,6 +5451,18 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu) destroy_sched_domain(sd, cpu); } +static void destroy_topology_level_masks(struct cpumask **tlm) +{ + if (tlm) { + struct cpumask **masks; + + for (masks = tlm; *masks; masks++) + kfree(*masks); + + kfree(tlm); + } +} + /* * Keep a special pointer to the highest sched_domain that has * SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this @@ -5493,6 +5505,67 @@ static void update_top_cache_domain(int cpu) rcu_assign_pointer(per_cpu(sd_asym, cpu), sd); } +DEFINE_PER_CPU(struct cpumask **, sd_tlm); + +static struct cpumask ** +build_topology_level_masks(struct sched_domain *domain, int cpu) +{ + struct sched_domain *sd; + struct cpumask **ret; + struct cpumask *mask; + struct cpumask *prev; + int node = cpu_to_node(cpu); + int ndoms = 0; + int level = 0; + + prev = kzalloc(cpumask_size(), GFP_KERNEL); + if (!prev) + return NULL; + + for (sd = domain; sd; sd = sd->parent) { + if (sd->parent && cpumask_equal(sched_domain_span(sd->parent), + sched_domain_span(sd))) + continue; + ndoms++; + } + + ret = kzalloc_node((ndoms + 1) * sizeof(ret[0]), GFP_KERNEL, node); + if (!ret) + goto err; + + for (sd = domain; sd; sd = sd->parent) { + if (sd->parent && cpumask_equal(sched_domain_span(sd->parent), + sched_domain_span(sd))) + continue; + + if (cpumask_equal(sched_domain_span(sd), prev)) + break; + + mask = kzalloc_node(cpumask_size(), GFP_KERNEL, node); + if (!mask) + goto err; + + cpumask_andnot(mask, sched_domain_span(sd), prev); + cpumask_or(prev, prev, sched_domain_span(sd)); + + ret[level] = mask; + level++; + } + + WARN_ON_ONCE(level != ndoms); + +err: + kfree(prev); + return ret; +} + +static void cpu_attach_topology_level_masks(struct cpumask **masks, int cpu) +{ + struct cpumask **tmp = per_cpu(sd_tlm, cpu); + per_cpu(sd_tlm, cpu) = masks; + destroy_topology_level_masks(tmp); +} + /* * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. @@ -5569,6 +5642,7 @@ struct sd_data { struct s_data { struct sched_domain ** __percpu sd; + struct cpumask *** __percpu masks; struct root_domain *rd; }; @@ -5894,6 +5968,7 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what, if (!atomic_read(&d->rd->refcount)) free_rootdomain(&d->rd->rcu); /* fall through */ case sa_sd: + free_percpu(d->masks); free_percpu(d->sd); /* fall through */ case sa_sd_storage: __sdt_free(cpu_map); /* fall through */ @@ -5912,6 +5987,9 @@ static enum s_alloc __visit_domain_allocation_hell(struct s_data *d, d->sd = alloc_percpu(struct sched_domain *); if (!d->sd) return sa_sd_storage; + d->masks = alloc_percpu(struct cpumask **); + if (!d->masks) + return sa_sd; d->rd = alloc_rootdomain(); if (!d->rd) return sa_sd; @@ -6373,6 +6451,7 @@ static int build_sched_domains(const struct cpumask *cpu_map, { enum s_alloc alloc_state; struct sched_domain *sd; + struct cpumask **masks; struct s_data d; int i, ret = -ENOMEM; @@ -6421,11 +6500,21 @@ static int build_sched_domains(const struct cpumask *cpu_map, } } + /* Build topology level plain masks for the domains */ + for_each_cpu(i, cpu_map) { + sd = *per_cpu_ptr(d.sd, i); + masks = build_topology_level_masks(sd, i); + *per_cpu_ptr(d.masks, i) = masks; + } + /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { sd = *per_cpu_ptr(d.sd, i); cpu_attach_domain(sd, d.rd, i); + + masks = *per_cpu_ptr(d.masks, i); + cpu_attach_topology_level_masks(masks, i); } rcu_read_unlock(); -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() 2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev 2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev @ 2014-03-26 13:34 ` Alexander Gordeev 2014-04-22 7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev 2014-04-22 11:56 ` Peter Zijlstra 3 siblings, 0 replies; 26+ messages in thread From: Alexander Gordeev @ 2014-03-26 13:34 UTC (permalink / raw) To: linux-kernel Cc: Alexander Gordeev, Kent Overstreet, Jens Axboe, Shaohua Li, Nicholas Bellinger Function steal_tags() iterates thru 'cpus_have_tags' cpumask ignoring the system's CPU topology. That leads to situations when a newly stolen tag and the data structure(s) associated with it (i.e. struct request in the block layer) topology- wise are more remote from the stealing CPU than it could have been had steal_tags() take the system's topology into accout. This update makes use of for_each_tlm() macro and softens the problem described above. As result, cache misses caused by accesses from a CPU that stole a tag to that tag's associated data are lessened. As a side effect, percpu_ida::cpu_last_stolen variable became superfluous and get eliminated. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Shaohua Li <shli@kernel.org> Cc: Nicholas Bellinger <nab@linux-iscsi.org> --- include/linux/percpu_ida.h | 1 - lib/percpu_ida.c | 46 ++++++++++++++++++------------------------- 2 files changed, 19 insertions(+), 28 deletions(-) diff --git a/include/linux/percpu_ida.h b/include/linux/percpu_ida.h index f5cfdd6..0d891f3 100644 --- a/include/linux/percpu_ida.h +++ b/include/linux/percpu_ida.h @@ -40,7 +40,6 @@ struct percpu_ida { * we want to pick a cpu at random. Cycling through them every * time we steal is a bit easier and more or less equivalent: */ - unsigned cpu_last_stolen; /* For sleeping on allocation failure */ wait_queue_head_t wait; diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c index 93d145e..5c51baa 100644 --- a/lib/percpu_ida.c +++ b/lib/percpu_ida.c @@ -63,42 +63,34 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr, static inline void steal_tags(struct percpu_ida *pool, struct percpu_ida_cpu *tags) { - unsigned cpus_have_tags, cpu = pool->cpu_last_stolen; struct percpu_ida_cpu *remote; + struct cpumask **tlm; + int cpu; - for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags); - cpus_have_tags; cpus_have_tags--) { - cpu = cpumask_next(cpu, &pool->cpus_have_tags); + for_each_tlm(tlm) { + for_each_cpu_and(cpu, *tlm, &pool->cpus_have_tags) { + cpumask_clear_cpu(cpu, &pool->cpus_have_tags); - if (cpu >= nr_cpu_ids) { - cpu = cpumask_first(&pool->cpus_have_tags); - if (cpu >= nr_cpu_ids) - BUG(); - } + remote = per_cpu_ptr(pool->tag_cpu, cpu); + if (remote == tags) + continue; - pool->cpu_last_stolen = cpu; - remote = per_cpu_ptr(pool->tag_cpu, cpu); + spin_lock(&remote->lock); - cpumask_clear_cpu(cpu, &pool->cpus_have_tags); + if (remote->nr_free) { + memcpy(tags->freelist, + remote->freelist, + sizeof(unsigned) * remote->nr_free); - if (remote == tags) - continue; + tags->nr_free = remote->nr_free; + remote->nr_free = 0; + } - spin_lock(&remote->lock); + spin_unlock(&remote->lock); - if (remote->nr_free) { - memcpy(tags->freelist, - remote->freelist, - sizeof(unsigned) * remote->nr_free); - - tags->nr_free = remote->nr_free; - remote->nr_free = 0; + if (tags->nr_free) + return; } - - spin_unlock(&remote->lock); - - if (tags->nr_free) - break; } } -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev 2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev 2014-03-26 13:34 ` [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() Alexander Gordeev @ 2014-04-22 7:10 ` Alexander Gordeev 2014-04-22 14:03 ` Jens Axboe 2014-04-22 11:56 ` Peter Zijlstra 3 siblings, 1 reply; 26+ messages in thread From: Alexander Gordeev @ 2014-04-22 7:10 UTC (permalink / raw) To: linux-kernel Cc: Kent Overstreet, Jens Axboe, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: > But other systems (more dense?) showed increased cache-hit rate > up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Thanks! -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-22 7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev @ 2014-04-22 14:03 ` Jens Axboe 2014-04-22 15:57 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2014-04-22 14:03 UTC (permalink / raw) To: Alexander Gordeev, linux-kernel Cc: Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 2014-04-22 01:10, Alexander Gordeev wrote: > On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: >> But other systems (more dense?) showed increased cache-hit rate >> up to 20%, i.e. this one: > > Hello Gentlemen, > > Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-22 14:03 ` Jens Axboe @ 2014-04-22 15:57 ` Jens Axboe 2014-04-23 0:53 ` Ming Lei 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2014-04-22 15:57 UTC (permalink / raw) To: Alexander Gordeev, linux-kernel Cc: Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 2677 bytes --] On 04/22/2014 08:03 AM, Jens Axboe wrote: > On 2014-04-22 01:10, Alexander Gordeev wrote: >> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: >>> But other systems (more dense?) showed increased cache-hit rate >>> up to 20%, i.e. this one: >> >> Hello Gentlemen, >> >> Any feedback on this? > > Sorry for dropping the ball on this. Improvements wrt when to steal, how > much, and from whom are sorely needed in percpu_ida. I'll do a bench > with this on a system that currently falls apart with it. Ran some quick numbers with three kernels: stock 3.15-rc2 limit 3.15-rc2 + steal limit patch (attached) limit+ag 3.15-rc2 + steal limit + your topology patch Two tests were run - the device has an effective queue depth limit of 255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both exercise near-limit depth and over limit depth. 8 processes were used, split into two groups. One group would always run on the local node, the other would be run on the adjacent node (near) or on the far node (far). Near + low ----------- IOPS sys time stock 1009.5K 55.78% limit 1084.4K 54.47% limit+ag 1058.1K 52.42% Near + high ----------- IOPS sys time stock 949.1K 75.12% limit 980.7K 64.74% limit+ag 1010.1K 70.27% Far + low --------- IOPS sys time stock 600.0K 72.28% limit 761.7K 71.17% limit+ag 762.5K 74.48% Far + high ---------- IOPS sys time stock 465.9K 91.66% limit 716.2K 88.68% limit+ag 758.0K 91.00% One huge issue on this box is that it's a 4 socket/node machine, with 32 cores (64 threads). Combined with a 255 queue depth limit, the percpu caching does not work well. I did not include stock+ag results, they didn't change things very much for me. We simply have to limit the stealing first, or we're still going to be hammering on percpu locks. If we compare the top profiles from stock-far-high and limit+ag-far-high, it looks pretty scary. Here's the stock one: - 50,84% fio [kernel.kallsyms] _raw_spin_lock + 89,83% percpu_ida_alloc + 6,03% mtip_queue_rq + 2,90% percpu_ida_free so 50% of the system time spent acquiring a spinlock, with 90% of that being percpu ida. The limit+ag variant looks like this: - 32,93% fio [kernel.kallsyms] _raw_spin_lock + 78,35% percpu_ida_alloc + 19,49% mtip_queue_rq + 1,21% __blk_mq_run_hw_queue which is still pretty horrid and has plenty of room for improvement. I think we need to make better decisions on the granularity of the tag caching. If we ignore thread siblings, that'll double our effective caching. If that's still not enough, I bet per-node/socket would be a huge improvement. -- Jens Axboe [-- Attachment #2: limit-steal.patch --] [-- Type: text/x-patch, Size: 690 bytes --] diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 7a799c4..689bbaf 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -109,6 +109,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, { unsigned int nr_tags, nr_cache; struct blk_mq_tags *tags; + unsigned int num_cpus; int ret; if (total_tags > BLK_MQ_TAG_MAX) { @@ -121,7 +122,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, return NULL; nr_tags = total_tags - reserved_tags; - nr_cache = nr_tags / num_possible_cpus(); + num_cpus = min(8U, num_online_cpus()); + nr_cache = nr_tags / num_cpus; if (nr_cache < BLK_MQ_TAG_CACHE_MIN) nr_cache = BLK_MQ_TAG_CACHE_MIN; ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-22 15:57 ` Jens Axboe @ 2014-04-23 0:53 ` Ming Lei 2014-04-23 1:25 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Ming Lei @ 2014-04-23 0:53 UTC (permalink / raw) To: Jens Axboe Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra Hi Jens, On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe <axboe@kernel.dk> wrote: > On 04/22/2014 08:03 AM, Jens Axboe wrote: >> On 2014-04-22 01:10, Alexander Gordeev wrote: >>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: >>>> But other systems (more dense?) showed increased cache-hit rate >>>> up to 20%, i.e. this one: >>> >>> Hello Gentlemen, >>> >>> Any feedback on this? >> >> Sorry for dropping the ball on this. Improvements wrt when to steal, how >> much, and from whom are sorely needed in percpu_ida. I'll do a bench >> with this on a system that currently falls apart with it. > > Ran some quick numbers with three kernels: > > stock 3.15-rc2 > limit 3.15-rc2 + steal limit patch (attached) I am thinking/working on this sort of improving too, but my idea is to compute tags->nr_max_cache by below: nr_tags / hctx->max_nr_ctx hctx->max_nr_ctx means the max sw queues mapped to the hw queue, which need to be introduced in the approach, actually, the value should represent the CPU topology info. It is a bit complicated to compute hctx->max_nr_ctx because we need to take account into CPU hotplug and probable user-defined mapping callback. If user-defined mapping callback needn't to be considered, the hctx->max_nr_ctx can be figured out before mapping sw queue in blk_mq_init_queue() by supposing each CPU is online first, once it is done, the map for offline CPU is cleared, then start to call blk_mq_map_swqueue(). In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% -> 50% for boot CPU 30% -> 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-23 0:53 ` Ming Lei @ 2014-04-23 1:25 ` Jens Axboe 2014-04-25 9:10 ` Ming Lei 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2014-04-23 1:25 UTC (permalink / raw) To: Ming Lei Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 2014-04-22 18:53, Ming Lei wrote: > Hi Jens, > > On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe <axboe@kernel.dk> wrote: >> On 04/22/2014 08:03 AM, Jens Axboe wrote: >>> On 2014-04-22 01:10, Alexander Gordeev wrote: >>>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: >>>>> But other systems (more dense?) showed increased cache-hit rate >>>>> up to 20%, i.e. this one: >>>> >>>> Hello Gentlemen, >>>> >>>> Any feedback on this? >>> >>> Sorry for dropping the ball on this. Improvements wrt when to steal, how >>> much, and from whom are sorely needed in percpu_ida. I'll do a bench >>> with this on a system that currently falls apart with it. >> >> Ran some quick numbers with three kernels: >> >> stock 3.15-rc2 >> limit 3.15-rc2 + steal limit patch (attached) > > I am thinking/working on this sort of improving too, but my > idea is to compute tags->nr_max_cache by below: > > nr_tags / hctx->max_nr_ctx > > hctx->max_nr_ctx means the max sw queues mapped to the > hw queue, which need to be introduced in the approach, actually, > the value should represent the CPU topology info. > > It is a bit complicated to compute hctx->max_nr_ctx because > we need to take account into CPU hotplug and probable > user-defined mapping callback. We can always just update the caching info, that's not a big problem. We update the mappings on those events anyway. > If user-defined mapping callback needn't to be considered, the > hctx->max_nr_ctx can be figured out before mapping sw > queue in blk_mq_init_queue() by supposing each CPU is > online first, once it is done, the map for offline CPU is cleared, > then start to call blk_mq_map_swqueue(). I don't see how a user defined mapping would change things a whole lot. It's just another point of updating the cache. Besides, user defined mappings will be mostly (only?) for things like multiqueue, where the caching info would likely remain static over a reconfigure. > In my null_blk test on a quad core SMP VM: > > - 4 hw queue > - timer mode > > With the above approach, tag allocation from local CPU can be > improved from: > > 5% -> 50% for boot CPU > 30% -> 90% for non-boot CPU. > > If no one objects the idea, I'd like to post a patch for review. Sent it out, that can't hurt. I'll take a look at it, and give it a test spin as well. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-23 1:25 ` Jens Axboe @ 2014-04-25 9:10 ` Ming Lei 2014-04-25 21:23 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Ming Lei @ 2014-04-25 9:10 UTC (permalink / raw) To: Jens Axboe Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra Hi Jens, On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe <axboe@kernel.dk> wrote: > On 2014-04-22 18:53, Ming Lei wrote: > > >> In my null_blk test on a quad core SMP VM: >> >> - 4 hw queue >> - timer mode >> >> With the above approach, tag allocation from local CPU can be >> improved from: >> >> 5% -> 50% for boot CPU >> 30% -> 90% for non-boot CPU. >> >> If no one objects the idea, I'd like to post a patch for review. > > > Sent it out, that can't hurt. I'll take a look at it, and give it a test > spin as well. I have sent out the patch in below link, and appreciate much you will take a look at it. marc.info/?l=linux-kernel&m=139826944613225&w=2 Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-25 9:10 ` Ming Lei @ 2014-04-25 21:23 ` Jens Axboe 2014-04-26 0:01 ` Ming Lei 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2014-04-25 21:23 UTC (permalink / raw) To: Ming Lei Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 04/25/2014 03:10 AM, Ming Lei wrote: > Hi Jens, > > On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe <axboe@kernel.dk> wrote: >> On 2014-04-22 18:53, Ming Lei wrote: >> >> >>> In my null_blk test on a quad core SMP VM: >>> >>> - 4 hw queue >>> - timer mode >>> >>> With the above approach, tag allocation from local CPU can be >>> improved from: >>> >>> 5% -> 50% for boot CPU >>> 30% -> 90% for non-boot CPU. >>> >>> If no one objects the idea, I'd like to post a patch for review. >> >> >> Sent it out, that can't hurt. I'll take a look at it, and give it a test >> spin as well. > > I have sent out the patch in below link, and appreciate much > you will take a look at it. > > marc.info/?l=linux-kernel&m=139826944613225&w=2 Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-25 21:23 ` Jens Axboe @ 2014-04-26 0:01 ` Ming Lei 2014-04-26 2:03 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Ming Lei @ 2014-04-26 0:01 UTC (permalink / raw) To: Jens Axboe Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: > On 04/25/2014 03:10 AM, Ming Lei wrote: > > Sorry, I did run it the other day. It has little to no effect here, but > that's mostly because there's so much other crap going on in there. The > most effective way to currently make it work better, is just to ensure > the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) >From percpu_ida view, it is easy to observe it can improve allocation performance. I have several patches to export these information by sysfs for monitoring percpu_ida performance. > > I've got an alternative tagging scheme that I think would be useful for > the cases where the tag space to cpu ratio isn't big enough. So I think > we'll retain percpu_ida for the cases where it can cache enough, and > punt to an alternative scheme when not. OK, care to comment on the patch or the idea of setting percpu cache size as (nr_tags / hctx->nr_ctx)? > > That doesn't mean we should not improve percpu_ida. There's quite a bit > of low hanging fruit in there. IMO percpu_max_size in percpu_ida is very important for the performance, and it might need to adjust dynamically according to the percpu allocation loading, but it is far more complicated to implement. And it might be the simplest way to fix the parameter before percpu_ida_init(). Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-26 0:01 ` Ming Lei @ 2014-04-26 2:03 ` Jens Axboe 2014-04-29 11:35 ` Ming Lei 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2014-04-26 2:03 UTC (permalink / raw) To: Ming Lei Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 2014-04-25 18:01, Ming Lei wrote: > Hi Jens, > > On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: >> On 04/25/2014 03:10 AM, Ming Lei wrote: >> >> Sorry, I did run it the other day. It has little to no effect here, but >> that's mostly because there's so much other crap going on in there. The >> most effective way to currently make it work better, is just to ensure >> the caching pool is of a sane size. > > Yes, that is just what the patch is doing, :-) But it's not enough. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. > From percpu_ida view, it is easy to observe it can improve > allocation performance. I have several patches to export > these information by sysfs for monitoring percpu_ida > performance. Sounds good! >> I've got an alternative tagging scheme that I think would be useful for >> the cases where the tag space to cpu ratio isn't big enough. So I think >> we'll retain percpu_ida for the cases where it can cache enough, and >> punt to an alternative scheme when not. > > OK, care to comment on the patch or the idea of setting percpu cache > size as (nr_tags / hctx->nr_ctx)? I think it's a good idea. The problem is that for percpu_ida to be effective, you need a bigger cache than the 3 I'd get above. If that isn't the case, it performs poorly. I'm just not convinced the design can ever work in the realm of realistic queue depths. >> That doesn't mean we should not improve percpu_ida. There's quite a bit >> of low hanging fruit in there. > > IMO percpu_max_size in percpu_ida is very important for the > performance, and it might need to adjust dynamically according > to the percpu allocation loading, but it is far more complicated > to implement. And it might be the simplest way to fix the parameter > before percpu_ida_init(). That's what I did, essentially. Ensuring that the percpu_max_size is at least 8 makes it a whole lot better here. But still slower than a regular simple bitmap, which makes me sad. A fairly straight forward cmpxchg based scheme I tested here is around 20% faster than the bitmap approach on a basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, I can't think of cases where that approach would not be faster than percpu_ida. That means all of SCSI, basically, and the basic block drivers. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-26 2:03 ` Jens Axboe @ 2014-04-29 11:35 ` Ming Lei 2014-04-29 21:13 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Ming Lei @ 2014-04-29 11:35 UTC (permalink / raw) To: Jens Axboe Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: > On 2014-04-25 18:01, Ming Lei wrote: >> >> Hi Jens, >> >> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: >>> >>> On 04/25/2014 03:10 AM, Ming Lei wrote: >>> >>> Sorry, I did run it the other day. It has little to no effect here, but >>> that's mostly because there's so much other crap going on in there. The >>> most effective way to currently make it work better, is just to ensure >>> the caching pool is of a sane size. >> >> >> Yes, that is just what the patch is doing, :-) > > > But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. > For instance, my test case, it's 255 tags and 64 CPUs. > We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. One of problems in blk-mq is that the 'set->queue_depth' parameter from driver isn't scalable, maybe it is reasonable to introduce 'set->min_percpu_cache', then ' tags->nr_max_cache' can be computed as below: max(nr_tags / hctx->nr_ctx, set->min_percpu_cache) Another problem in blk-mq is that if it can be improved by computing tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current approach should be based on that there are parallel I/O activity on each CPU, but I am wondering if it is the common case in reality. Suppose there are N(N << online CPUs in big machine) concurrent I/O on some of CPUs, percpu cache can be increased a lot by (nr_tags / N). > > >> From percpu_ida view, it is easy to observe it can improve >> allocation performance. I have several patches to export >> these information by sysfs for monitoring percpu_ida >> performance. > > > Sounds good! If we need exporting percpu_ida allocation/free information via sysfs for monitoring performance, percpu ida need a parent kobject, which means it may be better to allocate percpu_ida tag until sw/hw queue is initialized, like the patch does. > > >>> I've got an alternative tagging scheme that I think would be useful for >>> the cases where the tag space to cpu ratio isn't big enough. So I think >>> we'll retain percpu_ida for the cases where it can cache enough, and >>> punt to an alternative scheme when not. >> >> >> OK, care to comment on the patch or the idea of setting percpu cache >> size as (nr_tags / hctx->nr_ctx)? > > > I think it's a good idea. The problem is that for percpu_ida to be > effective, you need a bigger cache than the 3 I'd get above. If that isn't > the case, it performs poorly. I'm just not convinced the design can ever > work in the realm of realistic queue depths. > > > >>> That doesn't mean we should not improve percpu_ida. There's quite a bit >>> of low hanging fruit in there. >> >> >> IMO percpu_max_size in percpu_ida is very important for the >> performance, and it might need to adjust dynamically according >> to the percpu allocation loading, but it is far more complicated >> to implement. And it might be the simplest way to fix the parameter >> before percpu_ida_init(). > > > That's what I did, essentially. Ensuring that the percpu_max_size is at > least 8 makes it a whole lot better here. But still slower than a regular > simple bitmap, which makes me sad. A fairly straight forward cmpxchg based > scheme I tested here is around 20% faster than the bitmap approach on a > basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, > I can't think of cases where that approach would not be faster than > percpu_ida. That means all of SCSI, basically, and the basic block drivers. If percpu_ida wants to beat bitmap allocation, the local cache hit ratio has to keep high, in my tests, it can be got with enough local cache size. Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-29 11:35 ` Ming Lei @ 2014-04-29 21:13 ` Jens Axboe 2014-04-30 9:40 ` Ming Lei 2014-05-01 22:47 ` Kent Overstreet 0 siblings, 2 replies; 26+ messages in thread From: Jens Axboe @ 2014-04-29 21:13 UTC (permalink / raw) To: Ming Lei Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 04/29/2014 05:35 AM, Ming Lei wrote: > On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: >> On 2014-04-25 18:01, Ming Lei wrote: >>> >>> Hi Jens, >>> >>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>> >>>> On 04/25/2014 03:10 AM, Ming Lei wrote: >>>> >>>> Sorry, I did run it the other day. It has little to no effect here, but >>>> that's mostly because there's so much other crap going on in there. The >>>> most effective way to currently make it work better, is just to ensure >>>> the caching pool is of a sane size. >>> >>> >>> Yes, that is just what the patch is doing, :-) >> >> >> But it's not enough. > > Yes, the patch is only for cases of mutli hw queue and having > offline CPUs existed. > >> For instance, my test case, it's 255 tags and 64 CPUs. >> We end up in cross-cpu spinlock nightmare mode. > > IMO, the scaling problem for the above case might be > caused by either current percpu ida design or blk-mq's > usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). > One of problems in blk-mq is that the 'set->queue_depth' > parameter from driver isn't scalable, maybe it is reasonable to > introduce 'set->min_percpu_cache', then ' tags->nr_max_cache' > can be computed as below: > > max(nr_tags / hctx->nr_ctx, set->min_percpu_cache) > > Another problem in blk-mq is that if it can be improved by computing > tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current > approach should be based on that there are parallel I/O > activity on each CPU, but I am wondering if it is the common > case in reality. Suppose there are N(N << online CPUs in > big machine) concurrent I/O on some of CPUs, percpu cache > can be increased a lot by (nr_tags / N). That would certainly help the common case, but it'd still be slow for the cases where you DO have IO from lots of sources. If we consider 8-16 tags the minimum for balanced performance, than that doesn't take a whole lot of CPUs to spread out the tag space. Just looking at a case today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the "bigger" case of the Micron card, you still only have 255 active tags. And we probably want to split that up into groups of 32, making the problem even worse. >> That's what I did, essentially. Ensuring that the percpu_max_size is at >> least 8 makes it a whole lot better here. But still slower than a regular >> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based >> scheme I tested here is around 20% faster than the bitmap approach on a >> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, >> I can't think of cases where that approach would not be faster than >> percpu_ida. That means all of SCSI, basically, and the basic block drivers. > > If percpu_ida wants to beat bitmap allocation, the local cache hit > ratio has to keep high, in my tests, it can be got with enough local > cache size. Yes, that is exactly the issue, local cache hit must be high, and you pretty much need a higher local cache count for that. And therein lies the problem, you can't get that high local cache size for most common cases. With enough tags we could, but that's not what most people will run. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-29 21:13 ` Jens Axboe @ 2014-04-30 9:40 ` Ming Lei 2014-05-01 22:47 ` Kent Overstreet 1 sibling, 0 replies; 26+ messages in thread From: Ming Lei @ 2014-04-30 9:40 UTC (permalink / raw) To: Jens Axboe Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe <axboe@kernel.dk> wrote: > On 04/29/2014 05:35 AM, Ming Lei wrote: >> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: >>> On 2014-04-25 18:01, Ming Lei wrote: >>>> >>>> Hi Jens, >>>> >>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>>> >>>>> On 04/25/2014 03:10 AM, Ming Lei wrote: >>>>> >>>>> Sorry, I did run it the other day. It has little to no effect here, but >>>>> that's mostly because there's so much other crap going on in there. The >>>>> most effective way to currently make it work better, is just to ensure >>>>> the caching pool is of a sane size. >>>> >>>> >>>> Yes, that is just what the patch is doing, :-) >>> >>> >>> But it's not enough. >> >> Yes, the patch is only for cases of mutli hw queue and having >> offline CPUs existed. >> >>> For instance, my test case, it's 255 tags and 64 CPUs. >>> We end up in cross-cpu spinlock nightmare mode. >> >> IMO, the scaling problem for the above case might be >> caused by either current percpu ida design or blk-mq's >> usage on it. > > That is pretty much my claim, yes. Basically I don't think per-cpu tag > caching is ever going to be the best solution for the combination of > modern machines and the hardware that is out there (limited tags). > >> One of problems in blk-mq is that the 'set->queue_depth' >> parameter from driver isn't scalable, maybe it is reasonable to >> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache' >> can be computed as below: >> >> max(nr_tags / hctx->nr_ctx, set->min_percpu_cache) >> >> Another problem in blk-mq is that if it can be improved by computing >> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current >> approach should be based on that there are parallel I/O >> activity on each CPU, but I am wondering if it is the common >> case in reality. Suppose there are N(N << online CPUs in >> big machine) concurrent I/O on some of CPUs, percpu cache >> can be increased a lot by (nr_tags / N). > > That would certainly help the common case, but it'd still be slow for > the cases where you DO have IO from lots of sources. If we consider It may be difficult to figure out a efficient solution for the unusual case. 8-16 > tags the minimum for balanced performance, than that doesn't take a > whole lot of CPUs to spread out the tag space. Just looking at a case > today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the > "bigger" case of the Micron card, you still only have 255 active tags. > And we probably want to split that up into groups of 32, making the > problem even worse. Yes, that is a contradiction between having limited hardware queue tags and requiring more local cpu cache. But it is really a challenge to figure out an efficient approach in case that lots of CPUs need to contend very limited resources. Maybe blk-mq can learn network device: move the hw queue_depth constraint into low level(such as, blk_mq_run_hw_queue()), and keep adequate tags in the percpu pool, which means nr_tags of percpu pool can be much bigger than queue_depth for keeping enough percpu cache. When hw queue is full, congestion control can be applied in blk_mq_alloc_request() to avoid cross-cpu spinlock nightmare in percpu allocation/free. But if device requires tag to be less than queue_depth, more work is needed for the approach. Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-29 21:13 ` Jens Axboe 2014-04-30 9:40 ` Ming Lei @ 2014-05-01 22:47 ` Kent Overstreet 2014-05-02 2:19 ` Jens Axboe 1 sibling, 1 reply; 26+ messages in thread From: Kent Overstreet @ 2014-05-01 22:47 UTC (permalink / raw) To: Jens Axboe Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: > On 04/29/2014 05:35 AM, Ming Lei wrote: > > On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: > >> On 2014-04-25 18:01, Ming Lei wrote: > >>> > >>> Hi Jens, > >>> > >>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: > >>>> > >>>> On 04/25/2014 03:10 AM, Ming Lei wrote: > >>>> > >>>> Sorry, I did run it the other day. It has little to no effect here, but > >>>> that's mostly because there's so much other crap going on in there. The > >>>> most effective way to currently make it work better, is just to ensure > >>>> the caching pool is of a sane size. > >>> > >>> > >>> Yes, that is just what the patch is doing, :-) > >> > >> > >> But it's not enough. > > > > Yes, the patch is only for cases of mutli hw queue and having > > offline CPUs existed. > > > >> For instance, my test case, it's 255 tags and 64 CPUs. > >> We end up in cross-cpu spinlock nightmare mode. > > > > IMO, the scaling problem for the above case might be > > caused by either current percpu ida design or blk-mq's > > usage on it. > > That is pretty much my claim, yes. Basically I don't think per-cpu tag > caching is ever going to be the best solution for the combination of > modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-01 22:47 ` Kent Overstreet @ 2014-05-02 2:19 ` Jens Axboe 2014-05-02 2:38 ` Kent Overstreet 2014-05-02 5:05 ` Christoph Hellwig 0 siblings, 2 replies; 26+ messages in thread From: Jens Axboe @ 2014-05-02 2:19 UTC (permalink / raw) To: Kent Overstreet Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 2014-05-01 16:47, Kent Overstreet wrote: > On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: >> On 04/29/2014 05:35 AM, Ming Lei wrote: >>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>> On 2014-04-25 18:01, Ming Lei wrote: >>>>> >>>>> Hi Jens, >>>>> >>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>>>> >>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote: >>>>>> >>>>>> Sorry, I did run it the other day. It has little to no effect here, but >>>>>> that's mostly because there's so much other crap going on in there. The >>>>>> most effective way to currently make it work better, is just to ensure >>>>>> the caching pool is of a sane size. >>>>> >>>>> >>>>> Yes, that is just what the patch is doing, :-) >>>> >>>> >>>> But it's not enough. >>> >>> Yes, the patch is only for cases of mutli hw queue and having >>> offline CPUs existed. >>> >>>> For instance, my test case, it's 255 tags and 64 CPUs. >>>> We end up in cross-cpu spinlock nightmare mode. >>> >>> IMO, the scaling problem for the above case might be >>> caused by either current percpu ida design or blk-mq's >>> usage on it. >> >> That is pretty much my claim, yes. Basically I don't think per-cpu tag >> caching is ever going to be the best solution for the combination of >> modern machines and the hardware that is out there (limited tags). > > Sorry for not being more active in the discussion earlier, but anyways - I'm in > 100% agreement with this. > > Percpu freelists are _fundamentally_ only _useful_ when you don't need to be > using all your available tags, because percpu sharding requires wasting your tag > space. I could write a mathematical proof of this if I cared enough. > > Otherwise what happens is on alloc failure you're touching all the other > cachelines every single time and now you're bouncing _more_ cachelines than if > you just had a single global freelist. > > So yeah, for small tag spaces just use a single simple bit vector on a single > cacheline. I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. > BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed > when steal_tags() runs was fundamentally wrong and broken in this respect, and > should be reverted, whatever usage it was that was expecting to be able to > allocate the entire tag space was the problem. It's hard to blame Shaohua, and I helped push that. It was an attempt in making percpu_ida actually useful for what blk-mq needs it for, and being the primary user of it, it was definitely worth doing. A tagging scheme that requires the tag space to be effectively sparse/huge to be fast is not a good generic tagging algorithm. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-02 2:19 ` Jens Axboe @ 2014-05-02 2:38 ` Kent Overstreet 2014-05-02 2:44 ` Jens Axboe 2014-05-02 5:05 ` Christoph Hellwig 1 sibling, 1 reply; 26+ messages in thread From: Kent Overstreet @ 2014-05-02 2:38 UTC (permalink / raw) To: Jens Axboe Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: > On 2014-05-01 16:47, Kent Overstreet wrote: > >On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: > >>On 04/29/2014 05:35 AM, Ming Lei wrote: > >>>On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: > >>>>On 2014-04-25 18:01, Ming Lei wrote: > >>>>> > >>>>>Hi Jens, > >>>>> > >>>>>On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: > >>>>>> > >>>>>>On 04/25/2014 03:10 AM, Ming Lei wrote: > >>>>>> > >>>>>>Sorry, I did run it the other day. It has little to no effect here, but > >>>>>>that's mostly because there's so much other crap going on in there. The > >>>>>>most effective way to currently make it work better, is just to ensure > >>>>>>the caching pool is of a sane size. > >>>>> > >>>>> > >>>>>Yes, that is just what the patch is doing, :-) > >>>> > >>>> > >>>>But it's not enough. > >>> > >>>Yes, the patch is only for cases of mutli hw queue and having > >>>offline CPUs existed. > >>> > >>>>For instance, my test case, it's 255 tags and 64 CPUs. > >>>>We end up in cross-cpu spinlock nightmare mode. > >>> > >>>IMO, the scaling problem for the above case might be > >>>caused by either current percpu ida design or blk-mq's > >>>usage on it. > >> > >>That is pretty much my claim, yes. Basically I don't think per-cpu tag > >>caching is ever going to be the best solution for the combination of > >>modern machines and the hardware that is out there (limited tags). > > > >Sorry for not being more active in the discussion earlier, but anyways - I'm in > >100% agreement with this. > > > >Percpu freelists are _fundamentally_ only _useful_ when you don't need to be > >using all your available tags, because percpu sharding requires wasting your tag > >space. I could write a mathematical proof of this if I cared enough. > > > >Otherwise what happens is on alloc failure you're touching all the other > >cachelines every single time and now you're bouncing _more_ cachelines than if > >you just had a single global freelist. > > > >So yeah, for small tag spaces just use a single simple bit vector on a single > >cacheline. > > I've taken the consequence of this and implemented another tagging scheme > that blk-mq will use if it deems that percpu_ida isn't going to be effective > for the device being initialized. But I really hate to have both of them in > there. Unfortunately I have no devices available that have a tag space that > will justify using percu_ida, so comparisons are a bit hard at the moment. > NVMe should change that, though, so decision will have to be deferred until > that is tested. Yeah, I agree that is annoying. I've thought about the issue too though and I haven't been able to come up with any better ideas, myself. A given driver probably should be able to always use one or the other though, so we shouldn't _need_ a runtime conditional because of this, though structuring the code that way might be more trouble than it's worth from my vague recollection of what blk-mq looks likee... (I've actually been fighting with unrelated issues at a very similar layer of abstraction, it's quite annoying.). > >BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed > >when steal_tags() runs was fundamentally wrong and broken in this respect, and > >should be reverted, whatever usage it was that was expecting to be able to > >allocate the entire tag space was the problem. > > It's hard to blame Shaohua, and I helped push that. It was an attempt in > making percpu_ida actually useful for what blk-mq needs it for, and being > the primary user of it, it was definitely worth doing. A tagging scheme that > requires the tag space to be effectively sparse/huge to be fast is not a > good generic tagging algorithm. Yeah it was definitely not an unreasonable attempt and it's probably my own fault for not speaking up louder at the time (I can't remember how much I commented at the time). Ah well, irritating problem space :) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-02 2:38 ` Kent Overstreet @ 2014-05-02 2:44 ` Jens Axboe 0 siblings, 0 replies; 26+ messages in thread From: Jens Axboe @ 2014-05-02 2:44 UTC (permalink / raw) To: Kent Overstreet Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 2014-05-01 20:38, Kent Overstreet wrote: > On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: >> On 2014-05-01 16:47, Kent Overstreet wrote: >>> On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: >>>> On 04/29/2014 05:35 AM, Ming Lei wrote: >>>>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>>>> On 2014-04-25 18:01, Ming Lei wrote: >>>>>>> >>>>>>> Hi Jens, >>>>>>> >>>>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>>>>>> >>>>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote: >>>>>>>> >>>>>>>> Sorry, I did run it the other day. It has little to no effect here, but >>>>>>>> that's mostly because there's so much other crap going on in there. The >>>>>>>> most effective way to currently make it work better, is just to ensure >>>>>>>> the caching pool is of a sane size. >>>>>>> >>>>>>> >>>>>>> Yes, that is just what the patch is doing, :-) >>>>>> >>>>>> >>>>>> But it's not enough. >>>>> >>>>> Yes, the patch is only for cases of mutli hw queue and having >>>>> offline CPUs existed. >>>>> >>>>>> For instance, my test case, it's 255 tags and 64 CPUs. >>>>>> We end up in cross-cpu spinlock nightmare mode. >>>>> >>>>> IMO, the scaling problem for the above case might be >>>>> caused by either current percpu ida design or blk-mq's >>>>> usage on it. >>>> >>>> That is pretty much my claim, yes. Basically I don't think per-cpu tag >>>> caching is ever going to be the best solution for the combination of >>>> modern machines and the hardware that is out there (limited tags). >>> >>> Sorry for not being more active in the discussion earlier, but anyways - I'm in >>> 100% agreement with this. >>> >>> Percpu freelists are _fundamentally_ only _useful_ when you don't need to be >>> using all your available tags, because percpu sharding requires wasting your tag >>> space. I could write a mathematical proof of this if I cared enough. >>> >>> Otherwise what happens is on alloc failure you're touching all the other >>> cachelines every single time and now you're bouncing _more_ cachelines than if >>> you just had a single global freelist. >>> >>> So yeah, for small tag spaces just use a single simple bit vector on a single >>> cacheline. >> >> I've taken the consequence of this and implemented another tagging scheme >> that blk-mq will use if it deems that percpu_ida isn't going to be effective >> for the device being initialized. But I really hate to have both of them in >> there. Unfortunately I have no devices available that have a tag space that >> will justify using percu_ida, so comparisons are a bit hard at the moment. >> NVMe should change that, though, so decision will have to be deferred until >> that is tested. > > Yeah, I agree that is annoying. I've thought about the issue too though and I > haven't been able to come up with any better ideas, myself. I have failed in that area, too, and it's not for lack of thinking about it and experimenting. So hence a new solution was thought up, based on a lot of userland prototyping and testing. Things considered, two solutions is better than no solution. > A given driver probably should be able to always use one or the other though, so > we shouldn't _need_ a runtime conditional because of this, though structuring > the code that way might be more trouble than it's worth from my vague > recollection of what blk-mq looks likee... It's completely runtime conditional at this point, not sure how not to make it so. This is transparent to drivers, they should not care about what kind of tagging scheme to use. If we present that, then we have failed. The runtime conditional is still better than a function pointer, though, so it'll likely stay that way for now. So it the entry points now all look like this: if (use_new_scheme) ret = new_foo(); else ret = foo(); At least it should branch predict really well :-) > (I've actually been fighting with unrelated issues at a very similar layer of > abstraction, it's quite annoying.). > >>> BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed >>> when steal_tags() runs was fundamentally wrong and broken in this respect, and >>> should be reverted, whatever usage it was that was expecting to be able to >>> allocate the entire tag space was the problem. >> >> It's hard to blame Shaohua, and I helped push that. It was an attempt in >> making percpu_ida actually useful for what blk-mq needs it for, and being >> the primary user of it, it was definitely worth doing. A tagging scheme that >> requires the tag space to be effectively sparse/huge to be fast is not a >> good generic tagging algorithm. > > Yeah it was definitely not an unreasonable attempt and it's probably my own > fault for not speaking up louder at the time (I can't remember how much I > commented at the time). Ah well, irritating problem space :) Not a problem, I think the main failure here is that we have been coming at this with clashing expectations of what the requirements are. And a further issue is wanting to cling to the percpu_ida tags on my end, thinking it could be made to work. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-02 2:19 ` Jens Axboe 2014-05-02 2:38 ` Kent Overstreet @ 2014-05-02 5:05 ` Christoph Hellwig 2014-05-02 16:41 ` Jens Axboe 1 sibling, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2014-05-02 5:05 UTC (permalink / raw) To: Jens Axboe Cc: Kent Overstreet, Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: > I've taken the consequence of this and implemented another tagging > scheme that blk-mq will use if it deems that percpu_ida isn't going > to be effective for the device being initialized. But I really hate > to have both of them in there. Unfortunately I have no devices > available that have a tag space that will justify using percu_ida, > so comparisons are a bit hard at the moment. NVMe should change > that, though, so decision will have to be deferred until that is > tested. At least for SCSI devices _tag space_ is plenty, it's just the we artifically limit our tag space to the queue depth to avoid having to track that one separately. In addition we also preallocaste a request for each tag, so even if we would track the queue depth separately we would waste a lot of memory. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-02 5:05 ` Christoph Hellwig @ 2014-05-02 16:41 ` Jens Axboe 2014-05-02 16:43 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2014-05-02 16:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Kent Overstreet, Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 05/01/2014 11:05 PM, Christoph Hellwig wrote: > On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: >> I've taken the consequence of this and implemented another tagging >> scheme that blk-mq will use if it deems that percpu_ida isn't going >> to be effective for the device being initialized. But I really hate >> to have both of them in there. Unfortunately I have no devices >> available that have a tag space that will justify using percu_ida, >> so comparisons are a bit hard at the moment. NVMe should change >> that, though, so decision will have to be deferred until that is >> tested. > > At least for SCSI devices _tag space_ is plenty, it's just the we > artifically limit our tag space to the queue depth to avoid having to > track that one separately. In addition we also preallocaste a request > for each tag, so even if we would track the queue depth separately > we would waste a lot of memory. In practice it comes out to the same, it's not feasible to run a much larger space and track on queue depth. So I don't think that changes the conclusion for SCSI. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-02 16:41 ` Jens Axboe @ 2014-05-02 16:43 ` Christoph Hellwig 2014-05-02 16:56 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2014-05-02 16:43 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, Kent Overstreet, Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote: > > At least for SCSI devices _tag space_ is plenty, it's just the we > > artifically limit our tag space to the queue depth to avoid having to > > track that one separately. In addition we also preallocaste a request > > for each tag, so even if we would track the queue depth separately > > we would waste a lot of memory. > > In practice it comes out to the same, it's not feasible to run a much > larger space and track on queue depth. So I don't think that changes the > conclusion for SCSI. Agreed. Just had to check if my smart ass hat still fits this morning.. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-02 16:43 ` Christoph Hellwig @ 2014-05-02 16:56 ` Jens Axboe 0 siblings, 0 replies; 26+ messages in thread From: Jens Axboe @ 2014-05-02 16:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Kent Overstreet, Ming Lei, Alexander Gordeev, Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra On 05/02/2014 10:43 AM, Christoph Hellwig wrote: > On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote: >>> At least for SCSI devices _tag space_ is plenty, it's just the we >>> artifically limit our tag space to the queue depth to avoid having to >>> track that one separately. In addition we also preallocaste a request >>> for each tag, so even if we would track the queue depth separately >>> we would waste a lot of memory. >> >> In practice it comes out to the same, it's not feasible to run a much >> larger space and track on queue depth. So I don't think that changes the >> conclusion for SCSI. > > Agreed. Just had to check if my smart ass hat still fits this morning.. If not, these are on sale http://www.cafepress.com/+smartass_trucker_hat,18823771 :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev ` (2 preceding siblings ...) 2014-04-22 7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev @ 2014-04-22 11:56 ` Peter Zijlstra 2014-05-01 21:24 ` Alexander Gordeev 3 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2014-04-22 11:56 UTC (permalink / raw) To: Alexander Gordeev Cc: linux-kernel, Kent Overstreet, Jens Axboe, Shaohua Li, Nicholas Bellinger, Ingo Molnar On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: > Hello, > > This series is against 3.14.0-rc7. > > It is amied to further improve 'percpu_ida' tags locality by taking > into account system's CPU topology when stealing tags. That is try > to steal from a CPU which is 'closest' to the stealing one. > > I would not bother to post this, since on several system the change > did not show any improvement, i.e. on such one: There's is much lower level fruit to be had before doing something like this. https://lkml.org/lkml/2014/1/23/257 https://lkml.org/lkml/2014/1/23/329 https://lkml.org/lkml/2014/1/23/343 https://lkml.org/lkml/2014/1/23/354 I've not had time to revisit/finish them, but you should definitely clean up the percpu_ida stuff and reduce existing contention before going overboard. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-04-22 11:56 ` Peter Zijlstra @ 2014-05-01 21:24 ` Alexander Gordeev 2014-05-01 22:04 ` Alexander Gordeev 0 siblings, 1 reply; 26+ messages in thread From: Alexander Gordeev @ 2014-05-01 21:24 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Kent Overstreet, Jens Axboe, Shaohua Li, Nicholas Bellinger, Ingo Molnar On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote: > I've not had time to revisit/finish them, but you should definitely > clean up the percpu_ida stuff and reduce existing contention before > going overboard. Hi Peter, I tried to combine your series into a single patch against 3.15-rc3. I hope, it looks like you intended and my comments in follow-up message make sense. Thanks! diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c index 93d145e..14f5151 100644 --- a/lib/percpu_ida.c +++ b/lib/percpu_ida.c @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr, *dst_nr += nr; } +static inline void double_lock(spinlock_t *l1, spinlock_t *l2) +{ + if (l1 > l2) + swap(l1, l2); + + spin_lock(l1); + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); +} + /* * Try to steal tags from a remote cpu's percpu freelist. * @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool, if (remote == tags) continue; - spin_lock(&remote->lock); + double_lock(&tags->lock, &remote->lock); if (remote->nr_free) { memcpy(tags->freelist, @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool, } spin_unlock(&remote->lock); + spin_unlock(&tags->lock); if (tags->nr_free) break; } } -/* - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto - * our percpu freelist: - */ -static inline void alloc_global_tags(struct percpu_ida *pool, - struct percpu_ida_cpu *tags) +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) { - move_tags(tags->freelist, &tags->nr_free, - pool->freelist, &pool->nr_free, - min(pool->nr_free, pool->percpu_batch_size)); + int tag = -ENOSPC; + + spin_lock(&tags->lock); + if (tags->nr_free) + tag = tags->freelist[--tags->nr_free]; + spin_unlock(&tags->lock); + + return tag; } -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) +static inline int alloc_local_tag(struct percpu_ida *pool) { + struct percpu_ida_cpu *tags; + unsigned long flags; int tag = -ENOSPC; + local_irq_save(flags); + tags = this_cpu_ptr(pool->tag_cpu); spin_lock(&tags->lock); if (tags->nr_free) tag = tags->freelist[--tags->nr_free]; - spin_unlock(&tags->lock); + spin_unlock_irqrestore(&tags->lock, flags); return tag; } +static inline int alloc_global_tag(struct percpu_ida *pool) +{ + struct percpu_ida_cpu *tags; + unsigned long flags; + int tag = -ENOSPC; + + spin_lock_irqsave(&pool->lock, flags); + tags = this_cpu_ptr(pool->tag_cpu); + + if (!tags->nr_free) { + /* + * Move tags from the global-, onto our percpu-freelist. + */ + move_tags(tags->freelist, &tags->nr_free, + pool->freelist, &pool->nr_free, + min(pool->nr_free, pool->percpu_batch_size)); + } + + spin_unlock(&pool->lock); + + if (!tags->nr_free) + steal_tags(pool, tags); + + if (tags->nr_free) { + spin_lock(&tags->lock); + if (tags->nr_free) { + tag = tags->freelist[--tags->nr_free]; + if (tags->nr_free) { + cpumask_set_cpu(smp_processor_id(), + &pool->cpus_have_tags); + } + } + spin_unlock(&tags->lock); + } + + local_irq_restore(flags); + + return tag; +} + /** * percpu_ida_alloc - allocate a tag * @pool: pool to allocate from @@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) */ int percpu_ida_alloc(struct percpu_ida *pool, int state) { - DEFINE_WAIT(wait); - struct percpu_ida_cpu *tags; - unsigned long flags; - int tag; - - local_irq_save(flags); - tags = this_cpu_ptr(pool->tag_cpu); - - /* Fastpath */ - tag = alloc_local_tag(tags); - if (likely(tag >= 0)) { - local_irq_restore(flags); - return tag; - } - - while (1) { - spin_lock(&pool->lock); - - /* - * prepare_to_wait() must come before steal_tags(), in case - * percpu_ida_free() on another cpu flips a bit in - * cpus_have_tags - * - * global lock held and irqs disabled, don't need percpu lock - */ - if (state != TASK_RUNNING) - prepare_to_wait(&pool->wait, &wait, state); - - if (!tags->nr_free) - alloc_global_tags(pool, tags); - if (!tags->nr_free) - steal_tags(pool, tags); - - if (tags->nr_free) { - tag = tags->freelist[--tags->nr_free]; - if (tags->nr_free) - cpumask_set_cpu(smp_processor_id(), - &pool->cpus_have_tags); - } - - spin_unlock(&pool->lock); - local_irq_restore(flags); - - if (tag >= 0 || state == TASK_RUNNING) - break; - - if (signal_pending_state(state, current)) { - tag = -ERESTARTSYS; - break; - } - - schedule(); - - local_irq_save(flags); - tags = this_cpu_ptr(pool->tag_cpu); - } - if (state != TASK_RUNNING) - finish_wait(&pool->wait, &wait); - - return tag; + int ret; + + ret = alloc_local_tag(pool); + if (likely(ret >= 0)) + return ret; + + if (state != TASK_RUNNING) { + int tag; + + ret = ___wait_event(pool->wait, + (tag = alloc_global_tag(pool)) >= 0, + state, 0, 0, schedule()); + + if (tag >= 0) + ret = tag; + } else { + ret = alloc_global_tag(pool); + } + + return ret; } EXPORT_SYMBOL_GPL(percpu_ida_alloc); @@ -239,12 +253,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag) wake_up(&pool->wait); } + /* + * No point in waking when 1 < n < max_size free, because steal_tags() + * will assume max_size per set bit, having more free will not make it + * more or less likely it will visit this cpu. + */ + if (nr_free == pool->percpu_max_size) { spin_lock(&pool->lock); /* - * Global lock held and irqs disabled, don't need percpu - * lock + * Global lock held and irqs disabled, don't need percpu lock + * because everybody accessing remote @tags will hold + * pool->lock -- steal_tags(). */ if (tags->nr_free == pool->percpu_max_size) { move_tags(pool->freelist, &pool->nr_free, @@ -344,33 +365,31 @@ EXPORT_SYMBOL_GPL(__percpu_ida_init); int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn, void *data) { - unsigned long flags; struct percpu_ida_cpu *remote; - unsigned cpu, i, err = 0; + unsigned long flags; + int cpu, i, err = 0; - local_irq_save(flags); for_each_possible_cpu(cpu) { remote = per_cpu_ptr(pool->tag_cpu, cpu); - spin_lock(&remote->lock); + spin_lock_irqsave(&remote->lock, flags); for (i = 0; i < remote->nr_free; i++) { err = fn(remote->freelist[i], data); if (err) break; } - spin_unlock(&remote->lock); + spin_unlock_irqrestore(&remote->lock, flags); if (err) - goto out; + return err; } - spin_lock(&pool->lock); + spin_lock_irqsave(&pool->lock, flags); for (i = 0; i < pool->nr_free; i++) { err = fn(pool->freelist[i], data); if (err) break; } - spin_unlock(&pool->lock); -out: - local_irq_restore(flags); + spin_unlock_irqrestore(&pool->lock, flags); + return err; } EXPORT_SYMBOL_GPL(percpu_ida_for_each_free); -- 1.7.7.6 -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags 2014-05-01 21:24 ` Alexander Gordeev @ 2014-05-01 22:04 ` Alexander Gordeev 0 siblings, 0 replies; 26+ messages in thread From: Alexander Gordeev @ 2014-05-01 22:04 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Kent Overstreet, Jens Axboe, Shaohua Li, Nicholas Bellinger, Ingo Molnar On Thu, May 01, 2014 at 11:24:42PM +0200, Alexander Gordeev wrote: > On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote: > > I've not had time to revisit/finish them, but you should definitely > > clean up the percpu_ida stuff and reduce existing contention before > > going overboard. > > Hi Peter, > > I tried to combine your series into a single patch against 3.15-rc3. > > I hope, it looks like you intended and my comments in follow-up > message make sense. > > Thanks! > > diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c > index 93d145e..14f5151 100644 > --- a/lib/percpu_ida.c > +++ b/lib/percpu_ida.c > @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr, > *dst_nr += nr; > } > > +static inline void double_lock(spinlock_t *l1, spinlock_t *l2) > +{ > + if (l1 > l2) > + swap(l1, l2); > + > + spin_lock(l1); > + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); > +} > + > /* > * Try to steal tags from a remote cpu's percpu freelist. > * > @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool, > if (remote == tags) > continue; > > - spin_lock(&remote->lock); > + double_lock(&tags->lock, &remote->lock); > > if (remote->nr_free) { > memcpy(tags->freelist, > @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool, > } > > spin_unlock(&remote->lock); > + spin_unlock(&tags->lock); > > if (tags->nr_free) > break; > } > } > > -/* > - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto > - * our percpu freelist: > - */ > -static inline void alloc_global_tags(struct percpu_ida *pool, > - struct percpu_ida_cpu *tags) > +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) > { > - move_tags(tags->freelist, &tags->nr_free, > - pool->freelist, &pool->nr_free, > - min(pool->nr_free, pool->percpu_batch_size)); > + int tag = -ENOSPC; > + > + spin_lock(&tags->lock); > + if (tags->nr_free) > + tag = tags->freelist[--tags->nr_free]; > + spin_unlock(&tags->lock); > + > + return tag; > } > > -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) > +static inline int alloc_local_tag(struct percpu_ida *pool) > { > + struct percpu_ida_cpu *tags; > + unsigned long flags; > int tag = -ENOSPC; > > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); > spin_lock(&tags->lock); > if (tags->nr_free) > tag = tags->freelist[--tags->nr_free]; > - spin_unlock(&tags->lock); > + spin_unlock_irqrestore(&tags->lock, flags); > > return tag; > } > > +static inline int alloc_global_tag(struct percpu_ida *pool) > +{ > + struct percpu_ida_cpu *tags; > + unsigned long flags; > + int tag = -ENOSPC; > + > + spin_lock_irqsave(&pool->lock, flags); > + tags = this_cpu_ptr(pool->tag_cpu); If it makes more sense to make it the way alloc_local_tag() does? local_irq_save(flags); tags = this_cpu_ptr(pool->tag_cpu); spin_lock(&pool->lock); > + > + if (!tags->nr_free) { > + /* > + * Move tags from the global-, onto our percpu-freelist. > + */ > + move_tags(tags->freelist, &tags->nr_free, > + pool->freelist, &pool->nr_free, > + min(pool->nr_free, pool->percpu_batch_size)); 'tags' are accessed lock-less and race with steal_tags(). If double_lock() would help here? > + } > + > + spin_unlock(&pool->lock); It is quite difficult to comment here, since there is no affected code shown, but I will try nevertheless. So steal_tags() used to serialized with 'pool->lock', which is removed. As result, two (or more) concurrent threads could find and unset the very same bit in this piece of code: for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags); cpus_have_tags; cpus_have_tags--) { cpu = cpumask_next(cpu, &pool->cpus_have_tags); // *** FIND if (cpu >= nr_cpu_ids) { cpu = cpumask_first(&pool->cpus_have_tags); if (cpu >= nr_cpu_ids) BUG(); } pool->cpu_last_stolen = cpu; remote = per_cpu_ptr(pool->tag_cpu, cpu); cpumask_clear_cpu(cpu, &pool->cpus_have_tags); // *** UNSET ... } The second's thread cpumask_clear_cpu() races with cpumask_set_cpu() in percpu_ida_free() if (nr_free == 1) { cpumask_set_cpu(smp_processor_id(), &pool->cpus_have_tags); // here the bit is UNSET by the second concurrent thread wake_up(&pool->wait); } Which results in the woken thread would not find the illegitimately unset bit while looping over the mask in steal_tags(). IOW, the tag might get "lost". > + if (!tags->nr_free) > + steal_tags(pool, tags); > + > + if (tags->nr_free) { > + spin_lock(&tags->lock); > + if (tags->nr_free) { > + tag = tags->freelist[--tags->nr_free]; > + if (tags->nr_free) { > + cpumask_set_cpu(smp_processor_id(), > + &pool->cpus_have_tags); > + } > + } > + spin_unlock(&tags->lock); > + } > + > + local_irq_restore(flags); > + > + return tag; > +} > + > /** > * percpu_ida_alloc - allocate a tag > * @pool: pool to allocate from > @@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) > */ > int percpu_ida_alloc(struct percpu_ida *pool, int state) > { > - DEFINE_WAIT(wait); > - struct percpu_ida_cpu *tags; > - unsigned long flags; > - int tag; > - > - local_irq_save(flags); > - tags = this_cpu_ptr(pool->tag_cpu); > - > - /* Fastpath */ > - tag = alloc_local_tag(tags); > - if (likely(tag >= 0)) { > - local_irq_restore(flags); > - return tag; > - } > - > - while (1) { > - spin_lock(&pool->lock); > - > - /* > - * prepare_to_wait() must come before steal_tags(), in case > - * percpu_ida_free() on another cpu flips a bit in > - * cpus_have_tags > - * > - * global lock held and irqs disabled, don't need percpu lock > - */ > - if (state != TASK_RUNNING) > - prepare_to_wait(&pool->wait, &wait, state); > - > - if (!tags->nr_free) > - alloc_global_tags(pool, tags); > - if (!tags->nr_free) > - steal_tags(pool, tags); > - > - if (tags->nr_free) { > - tag = tags->freelist[--tags->nr_free]; > - if (tags->nr_free) > - cpumask_set_cpu(smp_processor_id(), > - &pool->cpus_have_tags); > - } > - > - spin_unlock(&pool->lock); > - local_irq_restore(flags); > - > - if (tag >= 0 || state == TASK_RUNNING) > - break; > - > - if (signal_pending_state(state, current)) { > - tag = -ERESTARTSYS; > - break; > - } > - > - schedule(); > - > - local_irq_save(flags); > - tags = this_cpu_ptr(pool->tag_cpu); > - } > - if (state != TASK_RUNNING) > - finish_wait(&pool->wait, &wait); > - > - return tag; > + int ret; > + > + ret = alloc_local_tag(pool); > + if (likely(ret >= 0)) > + return ret; > + > + if (state != TASK_RUNNING) { > + int tag; > + > + ret = ___wait_event(pool->wait, > + (tag = alloc_global_tag(pool)) >= 0, > + state, 0, 0, schedule()); > + > + if (tag >= 0) > + ret = tag; > + } else { > + ret = alloc_global_tag(pool); > + } > + > + return ret; > } > EXPORT_SYMBOL_GPL(percpu_ida_alloc); > > @@ -239,12 +253,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag) > wake_up(&pool->wait); > } > > + /* > + * No point in waking when 1 < n < max_size free, because steal_tags() > + * will assume max_size per set bit, having more free will not make it > + * more or less likely it will visit this cpu. > + */ > + > if (nr_free == pool->percpu_max_size) { > spin_lock(&pool->lock); > > /* > - * Global lock held and irqs disabled, don't need percpu > - * lock > + * Global lock held and irqs disabled, don't need percpu lock > + * because everybody accessing remote @tags will hold > + * pool->lock -- steal_tags(). > */ > if (tags->nr_free == pool->percpu_max_size) { > move_tags(pool->freelist, &pool->nr_free, > @@ -344,33 +365,31 @@ EXPORT_SYMBOL_GPL(__percpu_ida_init); > int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn, > void *data) > { > - unsigned long flags; > struct percpu_ida_cpu *remote; > - unsigned cpu, i, err = 0; > + unsigned long flags; > + int cpu, i, err = 0; > > - local_irq_save(flags); > for_each_possible_cpu(cpu) { > remote = per_cpu_ptr(pool->tag_cpu, cpu); > - spin_lock(&remote->lock); > + spin_lock_irqsave(&remote->lock, flags); > for (i = 0; i < remote->nr_free; i++) { > err = fn(remote->freelist[i], data); > if (err) > break; > } > - spin_unlock(&remote->lock); > + spin_unlock_irqrestore(&remote->lock, flags); > if (err) > - goto out; > + return err; > } > > - spin_lock(&pool->lock); > + spin_lock_irqsave(&pool->lock, flags); > for (i = 0; i < pool->nr_free; i++) { > err = fn(pool->freelist[i], data); > if (err) > break; > } > - spin_unlock(&pool->lock); > -out: > - local_irq_restore(flags); > + spin_unlock_irqrestore(&pool->lock, flags); > + > return err; > } > EXPORT_SYMBOL_GPL(percpu_ida_for_each_free); -- Regards, Alexander Gordeev agordeev@redhat.com ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2014-05-02 16:56 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev 2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev 2014-03-26 13:34 ` [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() Alexander Gordeev 2014-04-22 7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev 2014-04-22 14:03 ` Jens Axboe 2014-04-22 15:57 ` Jens Axboe 2014-04-23 0:53 ` Ming Lei 2014-04-23 1:25 ` Jens Axboe 2014-04-25 9:10 ` Ming Lei 2014-04-25 21:23 ` Jens Axboe 2014-04-26 0:01 ` Ming Lei 2014-04-26 2:03 ` Jens Axboe 2014-04-29 11:35 ` Ming Lei 2014-04-29 21:13 ` Jens Axboe 2014-04-30 9:40 ` Ming Lei 2014-05-01 22:47 ` Kent Overstreet 2014-05-02 2:19 ` Jens Axboe 2014-05-02 2:38 ` Kent Overstreet 2014-05-02 2:44 ` Jens Axboe 2014-05-02 5:05 ` Christoph Hellwig 2014-05-02 16:41 ` Jens Axboe 2014-05-02 16:43 ` Christoph Hellwig 2014-05-02 16:56 ` Jens Axboe 2014-04-22 11:56 ` Peter Zijlstra 2014-05-01 21:24 ` Alexander Gordeev 2014-05-01 22:04 ` Alexander Gordeev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox