[PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
@ 2014-03-26 13:34 Alexander Gordeev
  2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Alexander Gordeev @ 2014-03-26 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Kent Overstreet, Jens Axboe, Shaohua Li,
	Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

Hello,

This series is against 3.14.0-rc7.

It is amied to further improve 'percpu_ida' tags locality by taking
into account system's CPU topology when stealing tags. That is try
to steal from a CPU which is 'closest' to the stealing one.

I would not bother to post this, since on several system the change
did not show any improvement, i.e. on such one:

CPU0 attaching sched-domain:
 domain 0: span 0,8 level SIBLING
  groups: 0 (cpu_power = 588) 8 (cpu_power = 588)
  domain 1: span 0-3,8-11 level MC
   groups: 0,8 (cpu_power = 1176) 1,9 (cpu_power = 1176) 2,10 (cpu_power = 1176) 3,11 (cpu_power = 1176)
   domain 2: span 0-15 level NUMA
    groups: 0-3,8-11 (cpu_power = 4704) 4-7,12-15 (cpu_power = 4704)


But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:

CPU5 attaching sched-domain:
 domain 0: span 0-5 level MC
  groups: 5 (cpu_power = 1023) 0 (cpu_power = 1023) 1 (cpu_power = 1023) 2 (cpu_power = 1023) 3 (cpu_power = 1023) 4 (cpu_power = 1023)
  domain 1: span 0-7 level NUMA
   groups: 0-5 (cpu_power = 6138) 6-7 (cpu_power = 2046)
CPU6 attaching sched-domain:
 domain 0: span 6-7 level MC
  groups: 6 (cpu_power = 1023) 7 (cpu_power = 1023)
  domain 1: span 0-7 level NUMA
   groups: 6-7 (cpu_power = 2046) 0-5 (cpu_power = 6138)


I tested using 'null_blk' device with number of threads equal
to the number of CPUs with each thread affined to one CPU and
not affined, with no difference.

Suggestions are welcomed :)

Thanks!

Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

Alexander Gordeev (2):
  sched: Introduce topology level masks and for_each_tlm() macro
  percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags()

 include/linux/percpu_ida.h |    1 -
 include/linux/sched.h      |    5 ++
 kernel/sched/core.c        |   89 ++++++++++++++++++++++++++++++++++++++++++++
 lib/percpu_ida.c           |   46 +++++++++-------------
 4 files changed, 113 insertions(+), 28 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro
  2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
@ 2014-03-26 13:34 ` Alexander Gordeev
  2014-03-26 13:34 ` [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() Alexander Gordeev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Alexander Gordeev @ 2014-03-26 13:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Ingo Molnar, Peter Zijlstra

We have for_each_cpu_mask() (and friends) macro which is used
to enumerate CPUs in a given cpumask. Such enumeration walks
CPUs in ascending order of their IDs and does not take into
account CPU topology of the system. That is, each next CPU
could belong to any level of the system topology with regard
to the previously iterated CPU.

Yet, in some cases such indiscriminate enumeration could gain
suboptimal results when for_each_cpu_mask() is used to find-
first a CPU that matches certain criteria: if the finding
process prefers a CPU to be found as close as possible to the
current one (in the same core, package, last level cache etc.)
then for_each_cpu_mask() macro alone is not enough.

To facilitate convenient topology-aware cpumask enumeration
this update introduces a concept of topology level masks - an
per-cpu array of cpumask items. Each n-th item in this array
contains CPUs on this topology level minus all CPUs on all
[0..n-1] previous levels, relatively to this CPU.

Additionally, new macro for_each_tlm() is introduced to
enumerate the described above topology level masks. As result,
a topology-aware enumeration for a arbitrary cpumask 'mask'
would look like this:

	struct cpumask **tlm;
	int cpu;

	for_each_tlm(tlm) {
		for_each_cpu_and(cpu, *tlm, mask) {
			/*
			 * Check if 'cpu' matches criteria
			 */
		}
	}

It is generally recommended to disable preemption when calling
for_each_tlm() macro. Although it is not guaranteed the
finding process is not migrated to another CPU upon preemption
enable, we do not want it moved in the middle of the nested
loops causing accesses to the original CPU's topology level
masks.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |    5 +++
 kernel/sched/core.c   |   89 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..489df53 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2934,4 +2934,9 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+DECLARE_PER_CPU(struct cpumask **, sd_tlm);
+
+#define for_each_tlm(tlm)					\
+	for ((tlm) = this_cpu_read(sd_tlm); *(tlm); (tlm)++)
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6edbef2..da6d119 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5451,6 +5451,18 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
 		destroy_sched_domain(sd, cpu);
 }
 
+static void destroy_topology_level_masks(struct cpumask **tlm)
+{
+	if (tlm) {
+		struct cpumask **masks;
+
+		for (masks = tlm; *masks; masks++)
+			kfree(*masks);
+
+		kfree(tlm);
+	}
+}
+
 /*
  * Keep a special pointer to the highest sched_domain that has
  * SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this
@@ -5493,6 +5505,67 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+DEFINE_PER_CPU(struct cpumask **, sd_tlm);
+
+static struct cpumask **
+build_topology_level_masks(struct sched_domain *domain, int cpu)
+{
+	struct sched_domain *sd;
+	struct cpumask **ret;
+	struct cpumask *mask;
+	struct cpumask *prev;
+	int node = cpu_to_node(cpu);
+	int ndoms = 0;
+	int level = 0;
+
+	prev = kzalloc(cpumask_size(), GFP_KERNEL);
+	if (!prev)
+		return NULL;
+
+	for (sd = domain; sd; sd = sd->parent) {
+		if (sd->parent && cpumask_equal(sched_domain_span(sd->parent),
+						sched_domain_span(sd)))
+			continue;
+		ndoms++;
+	}
+
+	ret = kzalloc_node((ndoms + 1) * sizeof(ret[0]), GFP_KERNEL, node);
+	if (!ret)
+		goto err;
+
+	for (sd = domain; sd; sd = sd->parent) {
+		if (sd->parent && cpumask_equal(sched_domain_span(sd->parent),
+						sched_domain_span(sd)))
+			continue;
+
+		if (cpumask_equal(sched_domain_span(sd), prev))
+			break;
+
+		mask = kzalloc_node(cpumask_size(), GFP_KERNEL, node);
+		if (!mask)
+			goto err;
+
+		cpumask_andnot(mask, sched_domain_span(sd), prev);
+		cpumask_or(prev, prev, sched_domain_span(sd));
+
+		ret[level] = mask;
+		level++;
+	}
+
+	WARN_ON_ONCE(level != ndoms);
+
+err:
+	kfree(prev);
+	return ret;
+}
+
+static void cpu_attach_topology_level_masks(struct cpumask **masks, int cpu)
+{
+	struct cpumask **tmp = per_cpu(sd_tlm, cpu);
+	per_cpu(sd_tlm, cpu) = masks;
+	destroy_topology_level_masks(tmp);
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -5569,6 +5642,7 @@ struct sd_data {
 
 struct s_data {
 	struct sched_domain ** __percpu sd;
+	struct cpumask *** __percpu masks;
 	struct root_domain	*rd;
 };
 
@@ -5894,6 +5968,7 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
 		if (!atomic_read(&d->rd->refcount))
 			free_rootdomain(&d->rd->rcu); /* fall through */
 	case sa_sd:
+		free_percpu(d->masks);
 		free_percpu(d->sd); /* fall through */
 	case sa_sd_storage:
 		__sdt_free(cpu_map); /* fall through */
@@ -5912,6 +5987,9 @@ static enum s_alloc __visit_domain_allocation_hell(struct s_data *d,
 	d->sd = alloc_percpu(struct sched_domain *);
 	if (!d->sd)
 		return sa_sd_storage;
+	d->masks = alloc_percpu(struct cpumask **);
+	if (!d->masks)
+		return sa_sd;
 	d->rd = alloc_rootdomain();
 	if (!d->rd)
 		return sa_sd;
@@ -6373,6 +6451,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 {
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
+	struct cpumask **masks;
 	struct s_data d;
 	int i, ret = -ENOMEM;
 
@@ -6421,11 +6500,21 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 		}
 	}
 
+	/* Build topology level plain masks for the domains */
+	for_each_cpu(i, cpu_map) {
+		sd = *per_cpu_ptr(d.sd, i);
+		masks = build_topology_level_masks(sd, i);
+		*per_cpu_ptr(d.masks, i) = masks;
+	}
+
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
 		sd = *per_cpu_ptr(d.sd, i);
 		cpu_attach_domain(sd, d.rd, i);
+
+		masks = *per_cpu_ptr(d.masks, i);
+		cpu_attach_topology_level_masks(masks, i);
 	}
 	rcu_read_unlock();
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags()
  2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
  2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev
@ 2014-03-26 13:34 ` Alexander Gordeev
  2014-04-22  7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
  2014-04-22 11:56 ` Peter Zijlstra
  3 siblings, 0 replies; 26+ messages in thread
From: Alexander Gordeev @ 2014-03-26 13:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, Kent Overstreet, Jens Axboe, Shaohua Li,
	Nicholas Bellinger

Function steal_tags() iterates thru 'cpus_have_tags' cpumask
ignoring the system's CPU topology. That leads to situations
when a newly stolen tag and the data structure(s) associated
with it (i.e. struct request in the block layer) topology-
wise are more remote from the stealing CPU than it could have
been had steal_tags() take the system's topology into accout.

This update makes use of for_each_tlm() macro and softens the
problem described above. As result, cache misses caused by
accesses from a CPU that stole a tag to that tag's associated
data are lessened.

As a side effect, percpu_ida::cpu_last_stolen variable became
superfluous and get eliminated.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
---
 include/linux/percpu_ida.h |    1 -
 lib/percpu_ida.c           |   46 ++++++++++++++++++-------------------------
 2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/include/linux/percpu_ida.h b/include/linux/percpu_ida.h
index f5cfdd6..0d891f3 100644
--- a/include/linux/percpu_ida.h
+++ b/include/linux/percpu_ida.h
@@ -40,7 +40,6 @@ struct percpu_ida {
 		 * we want to pick a cpu at random. Cycling through them every
 		 * time we steal is a bit easier and more or less equivalent:
 		 */
-		unsigned		cpu_last_stolen;
 
 		/* For sleeping on allocation failure */
 		wait_queue_head_t	wait;
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 93d145e..5c51baa 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -63,42 +63,34 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr,
 static inline void steal_tags(struct percpu_ida *pool,
 			      struct percpu_ida_cpu *tags)
 {
-	unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
 	struct percpu_ida_cpu *remote;
+	struct cpumask **tlm;
+	int cpu;
 
-	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
-	     cpus_have_tags; cpus_have_tags--) {
-		cpu = cpumask_next(cpu, &pool->cpus_have_tags);
+	for_each_tlm(tlm) {
+		for_each_cpu_and(cpu, *tlm, &pool->cpus_have_tags) {
+			cpumask_clear_cpu(cpu, &pool->cpus_have_tags);
 
-		if (cpu >= nr_cpu_ids) {
-			cpu = cpumask_first(&pool->cpus_have_tags);
-			if (cpu >= nr_cpu_ids)
-				BUG();
-		}
+			remote = per_cpu_ptr(pool->tag_cpu, cpu);
+			if (remote == tags)
+				continue;
 
-		pool->cpu_last_stolen = cpu;
-		remote = per_cpu_ptr(pool->tag_cpu, cpu);
+			spin_lock(&remote->lock);
 
-		cpumask_clear_cpu(cpu, &pool->cpus_have_tags);
+			if (remote->nr_free) {
+				memcpy(tags->freelist,
+				       remote->freelist,
+				       sizeof(unsigned) * remote->nr_free);
 
-		if (remote == tags)
-			continue;
+				tags->nr_free = remote->nr_free;
+				remote->nr_free = 0;
+			}
 
-		spin_lock(&remote->lock);
+			spin_unlock(&remote->lock);
 
-		if (remote->nr_free) {
-			memcpy(tags->freelist,
-			       remote->freelist,
-			       sizeof(unsigned) * remote->nr_free);
-
-			tags->nr_free = remote->nr_free;
-			remote->nr_free = 0;
+			if (tags->nr_free)
+				return;
 		}
-
-		spin_unlock(&remote->lock);
-
-		if (tags->nr_free)
-			break;
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
  2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev
  2014-03-26 13:34 ` [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() Alexander Gordeev
@ 2014-04-22  7:10 ` Alexander Gordeev
  2014-04-22 14:03   ` Jens Axboe
  2014-04-22 11:56 ` Peter Zijlstra
  3 siblings, 1 reply; 26+ messages in thread
From: Alexander Gordeev @ 2014-04-22  7:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kent Overstreet, Jens Axboe, Shaohua Li, Nicholas Bellinger,
	Ingo Molnar, Peter Zijlstra

On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
> But other systems (more dense?) showed increased cache-hit rate
> up to 20%, i.e. this one:

Hello Gentlemen,

Any feedback on this?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-22  7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
@ 2014-04-22 14:03   ` Jens Axboe
  2014-04-22 15:57     ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2014-04-22 14:03 UTC (permalink / raw)
  To: Alexander Gordeev, linux-kernel
  Cc: Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar,
	Peter Zijlstra

On 2014-04-22 01:10, Alexander Gordeev wrote:
> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>> But other systems (more dense?) showed increased cache-hit rate
>> up to 20%, i.e. this one:
>
> Hello Gentlemen,
>
> Any feedback on this?

Sorry for dropping the ball on this. Improvements wrt when to steal, how 
much, and from whom are sorely needed in percpu_ida. I'll do a bench 
with this on a system that currently falls apart with it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-22 14:03   ` Jens Axboe
@ 2014-04-22 15:57     ` Jens Axboe
  2014-04-23  0:53       ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2014-04-22 15:57 UTC (permalink / raw)
  To: Alexander Gordeev, linux-kernel
  Cc: Kent Overstreet, Shaohua Li, Nicholas Bellinger, Ingo Molnar,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On 04/22/2014 08:03 AM, Jens Axboe wrote:
> On 2014-04-22 01:10, Alexander Gordeev wrote:
>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>> But other systems (more dense?) showed increased cache-hit rate
>>> up to 20%, i.e. this one:
>>
>> Hello Gentlemen,
>>
>> Any feedback on this?
> 
> Sorry for dropping the ball on this. Improvements wrt when to steal, how
> much, and from whom are sorely needed in percpu_ida. I'll do a bench
> with this on a system that currently falls apart with it.

Ran some quick numbers with three kernels:

stock		3.15-rc2
limit		3.15-rc2 + steal limit patch (attached)
limit+ag	3.15-rc2 + steal limit + your topology patch

Two tests were run - the device has an effective queue depth limit of
255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both
exercise near-limit depth and over limit depth. 8 processes were used,
split into two groups. One group would always run on the local node, the
other would be run on the adjacent node (near) or on the far node (far).

Near + low
-----------
		IOPS		sys time
stock		1009.5K		55.78%
limit		1084.4K		54.47%
limit+ag	1058.1K		52.42%

Near + high
-----------
		IOPS		sys time
stock		 949.1K		75.12%
limit		 980.7K		64.74%
limit+ag	1010.1K		70.27%

Far + low
---------
		IOPS		sys time
stock		600.0K		72.28%
limit		761.7K		71.17%
limit+ag	762.5K		74.48%

Far + high
----------
		IOPS		sys time
stock		465.9K		91.66%
limit		716.2K		88.68%
limit+ag	758.0K		91.00%

One huge issue on this box is that it's a 4 socket/node machine, with 32
cores (64 threads). Combined with a 255 queue depth limit, the percpu
caching does not work well. I did not include stock+ag results, they
didn't change things very much for me. We simply have to limit the
stealing first, or we're still going to be hammering on percpu locks. If
we compare the top profiles from stock-far-high and limit+ag-far-high,
it looks pretty scary. Here's the stock one:

-  50,84%  fio  [kernel.kallsyms]
_raw_spin_lock
      + 89,83% percpu_ida_alloc
      + 6,03% mtip_queue_rq
      + 2,90% percpu_ida_free

so 50% of the system time spent acquiring a spinlock, with 90% of that
being percpu ida. The limit+ag variant looks like this:

-  32,93%  fio  [kernel.kallsyms]
_raw_spin_lock
      + 78,35% percpu_ida_alloc
      + 19,49% mtip_queue_rq
      + 1,21% __blk_mq_run_hw_queue

which is still pretty horrid and has plenty of room for improvement. I
think we need to make better decisions on the granularity of the tag
caching. If we ignore thread siblings, that'll double our effective
caching. If that's still not enough, I bet per-node/socket would be a
huge improvement.

-- 
Jens Axboe

[-- Attachment #2: limit-steal.patch --]
[-- Type: text/x-patch, Size: 690 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7a799c4..689bbaf 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -109,6 +109,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 {
 	unsigned int nr_tags, nr_cache;
 	struct blk_mq_tags *tags;
+	unsigned int num_cpus;
 	int ret;

 	if (total_tags > BLK_MQ_TAG_MAX) {
@@ -121,7 +122,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 		return NULL;

 	nr_tags = total_tags - reserved_tags;
-	nr_cache = nr_tags / num_possible_cpus();
+	num_cpus = min(8U, num_online_cpus());
+	nr_cache = nr_tags / num_cpus;

 	if (nr_cache < BLK_MQ_TAG_CACHE_MIN)
 		nr_cache = BLK_MQ_TAG_CACHE_MIN;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-22 15:57     ` Jens Axboe
@ 2014-04-23  0:53       ` Ming Lei
  2014-04-23  1:25         ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2014-04-23  0:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

Hi Jens,

On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 04/22/2014 08:03 AM, Jens Axboe wrote:
>> On 2014-04-22 01:10, Alexander Gordeev wrote:
>>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>>> But other systems (more dense?) showed increased cache-hit rate
>>>> up to 20%, i.e. this one:
>>>
>>> Hello Gentlemen,
>>>
>>> Any feedback on this?
>>
>> Sorry for dropping the ball on this. Improvements wrt when to steal, how
>> much, and from whom are sorely needed in percpu_ida. I'll do a bench
>> with this on a system that currently falls apart with it.
>
> Ran some quick numbers with three kernels:
>
> stock           3.15-rc2
> limit           3.15-rc2 + steal limit patch (attached)

I am thinking/working on this sort of improving too, but my
idea is to compute tags->nr_max_cache by below:

                   nr_tags / hctx->max_nr_ctx

hctx->max_nr_ctx means the max sw queues mapped to the
hw queue, which need to be introduced in the approach, actually,
the value should represent the CPU topology info.

It is a bit complicated to compute hctx->max_nr_ctx because
we need to take account into CPU hotplug and probable
user-defined mapping callback.

If user-defined mapping callback needn't to be considered, the
hctx->max_nr_ctx can be figured out before mapping sw
queue in blk_mq_init_queue() by supposing each CPU is
online first, once it is done, the map for offline CPU is cleared,
then start to call blk_mq_map_swqueue().

In my null_blk test on a quad core SMP VM:

     - 4 hw queue
     - timer mode

With the above approach, tag allocation from local CPU can be
improved from:

        5%   -> 50% for boot CPU
        30% -> 90% for non-boot CPU.

If no one objects the idea, I'd like to post a patch for review.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-23  0:53       ` Ming Lei
@ 2014-04-23  1:25         ` Jens Axboe
  2014-04-25  9:10           ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2014-04-23  1:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On 2014-04-22 18:53, Ming Lei wrote:
> Hi Jens,
>
> On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 04/22/2014 08:03 AM, Jens Axboe wrote:
>>> On 2014-04-22 01:10, Alexander Gordeev wrote:
>>>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>>>> But other systems (more dense?) showed increased cache-hit rate
>>>>> up to 20%, i.e. this one:
>>>>
>>>> Hello Gentlemen,
>>>>
>>>> Any feedback on this?
>>>
>>> Sorry for dropping the ball on this. Improvements wrt when to steal, how
>>> much, and from whom are sorely needed in percpu_ida. I'll do a bench
>>> with this on a system that currently falls apart with it.
>>
>> Ran some quick numbers with three kernels:
>>
>> stock           3.15-rc2
>> limit           3.15-rc2 + steal limit patch (attached)
>
> I am thinking/working on this sort of improving too, but my
> idea is to compute tags->nr_max_cache by below:
>
>                     nr_tags / hctx->max_nr_ctx
>
> hctx->max_nr_ctx means the max sw queues mapped to the
> hw queue, which need to be introduced in the approach, actually,
> the value should represent the CPU topology info.
>
> It is a bit complicated to compute hctx->max_nr_ctx because
> we need to take account into CPU hotplug and probable
> user-defined mapping callback.

We can always just update the caching info, that's not a big problem. We 
update the mappings on those events anyway.

> If user-defined mapping callback needn't to be considered, the
> hctx->max_nr_ctx can be figured out before mapping sw
> queue in blk_mq_init_queue() by supposing each CPU is
> online first, once it is done, the map for offline CPU is cleared,
> then start to call blk_mq_map_swqueue().

I don't see how a user defined mapping would change things a whole lot. 
It's just another point of updating the cache. Besides, user defined 
mappings will be mostly (only?) for things like multiqueue, where the 
caching info would likely remain static over a reconfigure.

> In my null_blk test on a quad core SMP VM:
>
>       - 4 hw queue
>       - timer mode
>
> With the above approach, tag allocation from local CPU can be
> improved from:
>
>          5%   -> 50% for boot CPU
>          30% -> 90% for non-boot CPU.
>
> If no one objects the idea, I'd like to post a patch for review.

Sent it out, that can't hurt. I'll take a look at it, and give it a test 
spin as well.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-23  1:25         ` Jens Axboe
@ 2014-04-25  9:10           ` Ming Lei
  2014-04-25 21:23             ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2014-04-25  9:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

Hi Jens,

On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-04-22 18:53, Ming Lei wrote:
>
>
>> In my null_blk test on a quad core SMP VM:
>>
>>       - 4 hw queue
>>       - timer mode
>>
>> With the above approach, tag allocation from local CPU can be
>> improved from:
>>
>>          5%   -> 50% for boot CPU
>>          30% -> 90% for non-boot CPU.
>>
>> If no one objects the idea, I'd like to post a patch for review.
>
>
> Sent it out, that can't hurt. I'll take a look at it, and give it a test
> spin as well.

I have sent out the patch in below link, and appreciate much
you will take a look at it.

      marc.info/?l=linux-kernel&m=139826944613225&w=2

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-25  9:10           ` Ming Lei
@ 2014-04-25 21:23             ` Jens Axboe
  2014-04-26  0:01               ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2014-04-25 21:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On 04/25/2014 03:10 AM, Ming Lei wrote:
> Hi Jens,
> 
> On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-04-22 18:53, Ming Lei wrote:
>>
>>
>>> In my null_blk test on a quad core SMP VM:
>>>
>>>       - 4 hw queue
>>>       - timer mode
>>>
>>> With the above approach, tag allocation from local CPU can be
>>> improved from:
>>>
>>>          5%   -> 50% for boot CPU
>>>          30% -> 90% for non-boot CPU.
>>>
>>> If no one objects the idea, I'd like to post a patch for review.
>>
>>
>> Sent it out, that can't hurt. I'll take a look at it, and give it a test
>> spin as well.
> 
> I have sent out the patch in below link, and appreciate much
> you will take a look at it.
> 
>       marc.info/?l=linux-kernel&m=139826944613225&w=2

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.

I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.

That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-25 21:23             ` Jens Axboe
@ 2014-04-26  0:01               ` Ming Lei
  2014-04-26  2:03                 ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2014-04-26  0:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 04/25/2014 03:10 AM, Ming Lei wrote:
>
> Sorry, I did run it the other day. It has little to no effect here, but
> that's mostly because there's so much other crap going on in there. The
> most effective way to currently make it work better, is just to ensure
> the caching pool is of a sane size.

Yes, that is just what the patch is doing, :-)

>From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.

>
> I've got an alternative tagging scheme that I think would be useful for
> the cases where the tag space to cpu ratio isn't big enough. So I think
> we'll retain percpu_ida for the cases where it can cache enough, and
> punt to an alternative scheme when not.

OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx->nr_ctx)?

>
> That doesn't mean we should not improve percpu_ida. There's quite a bit
> of low hanging fruit in there.

IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-26  0:01               ` Ming Lei
@ 2014-04-26  2:03                 ` Jens Axboe
  2014-04-29 11:35                   ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2014-04-26  2:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On 2014-04-25 18:01, Ming Lei wrote:
> Hi Jens,
>
> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>
>> Sorry, I did run it the other day. It has little to no effect here, but
>> that's mostly because there's so much other crap going on in there. The
>> most effective way to currently make it work better, is just to ensure
>> the caching pool is of a sane size.
>
> Yes, that is just what the patch is doing, :-)

But it's not enough. For instance, my test case, it's 255 tags and 64 
CPUs. We end up in cross-cpu spinlock nightmare mode.

> From percpu_ida view, it is easy to observe it can improve
> allocation performance. I have several patches to export
> these information by sysfs for monitoring percpu_ida
> performance.

Sounds good!

>> I've got an alternative tagging scheme that I think would be useful for
>> the cases where the tag space to cpu ratio isn't big enough. So I think
>> we'll retain percpu_ida for the cases where it can cache enough, and
>> punt to an alternative scheme when not.
>
> OK, care to comment on the patch or the idea of setting percpu cache
> size as (nr_tags / hctx->nr_ctx)?

I think it's a good idea. The problem is that for percpu_ida to be 
effective, you need a bigger cache than the 3 I'd get above. If that 
isn't the case, it performs poorly. I'm just not convinced the design 
can ever work in the realm of realistic queue depths.

>> That doesn't mean we should not improve percpu_ida. There's quite a bit
>> of low hanging fruit in there.
>
> IMO percpu_max_size in percpu_ida is very important for the
> performance, and it might need to adjust dynamically according
> to the percpu allocation loading, but it is far more complicated
> to implement. And it might be the simplest way to fix the parameter
> before percpu_ida_init().

That's what I did, essentially. Ensuring that the percpu_max_size is at 
least 8 makes it a whole lot better here. But still slower than a 
regular simple bitmap, which makes me sad. A fairly straight forward 
cmpxchg based scheme I tested here is around 20% faster than the bitmap 
approach on a basic desktop machine, and around 35% faster on a 
4-socket. Outside of NVMe, I can't think of cases where that approach 
would not be faster than percpu_ida. That means all of SCSI, basically, 
and the basic block drivers.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-26  2:03                 ` Jens Axboe
@ 2014-04-29 11:35                   ` Ming Lei
  2014-04-29 21:13                     ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2014-04-29 11:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-04-25 18:01, Ming Lei wrote:
>>
>> Hi Jens,
>>
>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>
>>> Sorry, I did run it the other day. It has little to no effect here, but
>>> that's mostly because there's so much other crap going on in there. The
>>> most effective way to currently make it work better, is just to ensure
>>> the caching pool is of a sane size.
>>
>>
>> Yes, that is just what the patch is doing, :-)
>
>
> But it's not enough.

Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.

> For instance, my test case, it's 255 tags and 64 CPUs.
> We end up in cross-cpu spinlock nightmare mode.

IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.

One of problems in blk-mq is that the 'set->queue_depth'
parameter from driver isn't scalable, maybe it is reasonable to
introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
can be computed as below:

     max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)

Another problem in blk-mq is that if it can be improved by computing
tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ?  The current
approach should be based on that there are parallel I/O
activity on each CPU, but I am wondering if it is the common
case in reality.  Suppose there are N(N << online CPUs in
big machine) concurrent I/O on some of CPUs, percpu cache
can be increased a lot by (nr_tags / N).

>
>
>> From percpu_ida view, it is easy to observe it can improve
>> allocation performance. I have several patches to export
>> these information by sysfs for monitoring percpu_ida
>> performance.
>
>
> Sounds good!

If we need exporting percpu_ida allocation/free information via
sysfs for monitoring performance, percpu ida need a parent
kobject, which means it may be better to allocate percpu_ida
tag until sw/hw queue is initialized, like the patch does.

>
>
>>> I've got an alternative tagging scheme that I think would be useful for
>>> the cases where the tag space to cpu ratio isn't big enough. So I think
>>> we'll retain percpu_ida for the cases where it can cache enough, and
>>> punt to an alternative scheme when not.
>>
>>
>> OK, care to comment on the patch or the idea of setting percpu cache
>> size as (nr_tags / hctx->nr_ctx)?
>
>
> I think it's a good idea. The problem is that for percpu_ida to be
> effective, you need a bigger cache than the 3 I'd get above. If that isn't
> the case, it performs poorly. I'm just not convinced the design can ever
> work in the realm of realistic queue depths.
>
>
>
>>> That doesn't mean we should not improve percpu_ida. There's quite a bit
>>> of low hanging fruit in there.
>>
>>
>> IMO percpu_max_size in percpu_ida is very important for the
>> performance, and it might need to adjust dynamically according
>> to the percpu allocation loading, but it is far more complicated
>> to implement. And it might be the simplest way to fix the parameter
>> before percpu_ida_init().
>
>
> That's what I did, essentially. Ensuring that the percpu_max_size is at
> least 8 makes it a whole lot better here. But still slower than a regular
> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
> scheme I tested here is around 20% faster than the bitmap approach on a
> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
> I can't think of cases where that approach would not be faster than
> percpu_ida. That means all of SCSI, basically, and the basic block drivers.

If percpu_ida wants to beat bitmap allocation,  the local cache hit
ratio has to keep high, in my tests, it can be got with enough local
cache size.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-29 11:35                   ` Ming Lei
@ 2014-04-29 21:13                     ` Jens Axboe
  2014-04-30  9:40                       ` Ming Lei
  2014-05-01 22:47                       ` Kent Overstreet
  0 siblings, 2 replies; 26+ messages in thread
From: Jens Axboe @ 2014-04-29 21:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On 04/29/2014 05:35 AM, Ming Lei wrote:
> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 2014-04-25 18:01, Ming Lei wrote:
>>>
>>> Hi Jens,
>>>
>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>
>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>> that's mostly because there's so much other crap going on in there. The
>>>> most effective way to currently make it work better, is just to ensure
>>>> the caching pool is of a sane size.
>>>
>>>
>>> Yes, that is just what the patch is doing, :-)
>>
>>
>> But it's not enough.
> 
> Yes, the patch is only for cases of mutli hw queue and having
> offline CPUs existed.
> 
>> For instance, my test case, it's 255 tags and 64 CPUs.
>> We end up in cross-cpu spinlock nightmare mode.
> 
> IMO, the scaling problem for the above case might be
> caused by either current percpu ida design or blk-mq's
> usage on it.

That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).

> One of problems in blk-mq is that the 'set->queue_depth'
> parameter from driver isn't scalable, maybe it is reasonable to
> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
> can be computed as below:
> 
>      max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
> 
> Another problem in blk-mq is that if it can be improved by computing
> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ?  The current
> approach should be based on that there are parallel I/O
> activity on each CPU, but I am wondering if it is the common
> case in reality.  Suppose there are N(N << online CPUs in
> big machine) concurrent I/O on some of CPUs, percpu cache
> can be increased a lot by (nr_tags / N).

That would certainly help the common case, but it'd still be slow for
the cases where you DO have IO from lots of sources. If we consider 8-16
tags the minimum for balanced performance, than that doesn't take a
whole lot of CPUs to spread out the tag space. Just looking at a case
today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
"bigger" case of the Micron card, you still only have 255 active tags.
And we probably want to split that up into groups of 32, making the
problem even worse.

>> That's what I did, essentially. Ensuring that the percpu_max_size is at
>> least 8 makes it a whole lot better here. But still slower than a regular
>> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
>> scheme I tested here is around 20% faster than the bitmap approach on a
>> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
>> I can't think of cases where that approach would not be faster than
>> percpu_ida. That means all of SCSI, basically, and the basic block drivers.
> 
> If percpu_ida wants to beat bitmap allocation,  the local cache hit
> ratio has to keep high, in my tests, it can be got with enough local
> cache size.

Yes, that is exactly the issue, local cache hit must be high, and you
pretty much need a higher local cache count for that. And therein lies
the problem, you can't get that high local cache size for most common
cases. With enough tags we could, but that's not what most people will run.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-29 21:13                     ` Jens Axboe
@ 2014-04-30  9:40                       ` Ming Lei
  2014-05-01 22:47                       ` Kent Overstreet
  1 sibling, 0 replies; 26+ messages in thread
From: Ming Lei @ 2014-04-30  9:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alexander Gordeev, Linux Kernel Mailing List, Kent Overstreet,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 04/29/2014 05:35 AM, Ming Lei wrote:
>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>> On 2014-04-25 18:01, Ming Lei wrote:
>>>>
>>>> Hi Jens,
>>>>
>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>>
>>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>>> that's mostly because there's so much other crap going on in there. The
>>>>> most effective way to currently make it work better, is just to ensure
>>>>> the caching pool is of a sane size.
>>>>
>>>>
>>>> Yes, that is just what the patch is doing, :-)
>>>
>>>
>>> But it's not enough.
>>
>> Yes, the patch is only for cases of mutli hw queue and having
>> offline CPUs existed.
>>
>>> For instance, my test case, it's 255 tags and 64 CPUs.
>>> We end up in cross-cpu spinlock nightmare mode.
>>
>> IMO, the scaling problem for the above case might be
>> caused by either current percpu ida design or blk-mq's
>> usage on it.
>
> That is pretty much my claim, yes. Basically I don't think per-cpu tag
> caching is ever going to be the best solution for the combination of
> modern machines and the hardware that is out there (limited tags).
>
>> One of problems in blk-mq is that the 'set->queue_depth'
>> parameter from driver isn't scalable, maybe it is reasonable to
>> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
>> can be computed as below:
>>
>>      max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
>>
>> Another problem in blk-mq is that if it can be improved by computing
>> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ?  The current
>> approach should be based on that there are parallel I/O
>> activity on each CPU, but I am wondering if it is the common
>> case in reality.  Suppose there are N(N << online CPUs in
>> big machine) concurrent I/O on some of CPUs, percpu cache
>> can be increased a lot by (nr_tags / N).
>
> That would certainly help the common case, but it'd still be slow for
> the cases where you DO have IO from lots of sources. If we consider

It may be difficult to figure out a efficient solution for the unusual case.

 8-16
> tags the minimum for balanced performance, than that doesn't take a
> whole lot of CPUs to spread out the tag space. Just looking at a case
> today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
> "bigger" case of the Micron card, you still only have 255 active tags.
> And we probably want to split that up into groups of 32, making the
> problem even worse.

Yes, that is a contradiction between having limited hardware
queue tags and requiring more local cpu cache. But it is really
a challenge to figure out an efficient approach in case that lots of
CPUs need to contend very limited resources.

Maybe blk-mq can learn network device: move the hw
queue_depth constraint into low level(such as, blk_mq_run_hw_queue()),
and keep adequate tags in the percpu pool, which means
nr_tags of percpu pool can be much bigger than queue_depth
for keeping enough percpu cache. When hw queue is full,
congestion control can be applied in blk_mq_alloc_request()
to avoid cross-cpu spinlock nightmare in percpu allocation/free.
But if device requires tag to be less than queue_depth, more
work is needed for the approach.


Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-29 21:13                     ` Jens Axboe
  2014-04-30  9:40                       ` Ming Lei
@ 2014-05-01 22:47                       ` Kent Overstreet
  2014-05-02  2:19                         ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2014-05-01 22:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
> On 04/29/2014 05:35 AM, Ming Lei wrote:
> > On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >> On 2014-04-25 18:01, Ming Lei wrote:
> >>>
> >>> Hi Jens,
> >>>
> >>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
> >>>>
> >>>> Sorry, I did run it the other day. It has little to no effect here, but
> >>>> that's mostly because there's so much other crap going on in there. The
> >>>> most effective way to currently make it work better, is just to ensure
> >>>> the caching pool is of a sane size.
> >>>
> >>>
> >>> Yes, that is just what the patch is doing, :-)
> >>
> >>
> >> But it's not enough.
> > 
> > Yes, the patch is only for cases of mutli hw queue and having
> > offline CPUs existed.
> > 
> >> For instance, my test case, it's 255 tags and 64 CPUs.
> >> We end up in cross-cpu spinlock nightmare mode.
> > 
> > IMO, the scaling problem for the above case might be
> > caused by either current percpu ida design or blk-mq's
> > usage on it.
> 
> That is pretty much my claim, yes. Basically I don't think per-cpu tag
> caching is ever going to be the best solution for the combination of
> modern machines and the hardware that is out there (limited tags).

Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.

BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-01 22:47                       ` Kent Overstreet
@ 2014-05-02  2:19                         ` Jens Axboe
  2014-05-02  2:38                           ` Kent Overstreet
  2014-05-02  5:05                           ` Christoph Hellwig
  0 siblings, 2 replies; 26+ messages in thread
From: Jens Axboe @ 2014-05-02  2:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On 2014-05-01 16:47, Kent Overstreet wrote:
> On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
>> On 04/29/2014 05:35 AM, Ming Lei wrote:
>>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 2014-04-25 18:01, Ming Lei wrote:
>>>>>
>>>>> Hi Jens,
>>>>>
>>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>>>
>>>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>>>> that's mostly because there's so much other crap going on in there. The
>>>>>> most effective way to currently make it work better, is just to ensure
>>>>>> the caching pool is of a sane size.
>>>>>
>>>>>
>>>>> Yes, that is just what the patch is doing, :-)
>>>>
>>>>
>>>> But it's not enough.
>>>
>>> Yes, the patch is only for cases of mutli hw queue and having
>>> offline CPUs existed.
>>>
>>>> For instance, my test case, it's 255 tags and 64 CPUs.
>>>> We end up in cross-cpu spinlock nightmare mode.
>>>
>>> IMO, the scaling problem for the above case might be
>>> caused by either current percpu ida design or blk-mq's
>>> usage on it.
>>
>> That is pretty much my claim, yes. Basically I don't think per-cpu tag
>> caching is ever going to be the best solution for the combination of
>> modern machines and the hardware that is out there (limited tags).
>
> Sorry for not being more active in the discussion earlier, but anyways - I'm in
> 100% agreement with this.
>
> Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
> using all your available tags, because percpu sharding requires wasting your tag
> space. I could write a mathematical proof of this if I cared enough.
>
> Otherwise what happens is on alloc failure you're touching all the other
> cachelines every single time and now you're bouncing _more_ cachelines than if
> you just had a single global freelist.
>
> So yeah, for small tag spaces just use a single simple bit vector on a single
> cacheline.

I've taken the consequence of this and implemented another tagging 
scheme that blk-mq will use if it deems that percpu_ida isn't going to 
be effective for the device being initialized. But I really hate to have 
both of them in there. Unfortunately I have no devices available that 
have a tag space that will justify using percu_ida, so comparisons are a 
bit hard at the moment. NVMe should change that, though, so decision 
will have to be deferred until that is tested.

> BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
> when steal_tags() runs was fundamentally wrong and broken in this respect, and
> should be reverted, whatever usage it was that was expecting to be able to
> allocate the entire tag space was the problem.

It's hard to blame Shaohua, and I helped push that. It was an attempt in 
making percpu_ida actually useful for what blk-mq needs it for, and 
being the primary user of it, it was definitely worth doing. A tagging 
scheme that requires the tag space to be effectively sparse/huge to be 
fast is not a good generic tagging algorithm.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-02  2:19                         ` Jens Axboe
@ 2014-05-02  2:38                           ` Kent Overstreet
  2014-05-02  2:44                             ` Jens Axboe
  2014-05-02  5:05                           ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2014-05-02  2:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
> On 2014-05-01 16:47, Kent Overstreet wrote:
> >On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
> >>On 04/29/2014 05:35 AM, Ming Lei wrote:
> >>>On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>On 2014-04-25 18:01, Ming Lei wrote:
> >>>>>
> >>>>>Hi Jens,
> >>>>>
> >>>>>On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>>On 04/25/2014 03:10 AM, Ming Lei wrote:
> >>>>>>
> >>>>>>Sorry, I did run it the other day. It has little to no effect here, but
> >>>>>>that's mostly because there's so much other crap going on in there. The
> >>>>>>most effective way to currently make it work better, is just to ensure
> >>>>>>the caching pool is of a sane size.
> >>>>>
> >>>>>
> >>>>>Yes, that is just what the patch is doing, :-)
> >>>>
> >>>>
> >>>>But it's not enough.
> >>>
> >>>Yes, the patch is only for cases of mutli hw queue and having
> >>>offline CPUs existed.
> >>>
> >>>>For instance, my test case, it's 255 tags and 64 CPUs.
> >>>>We end up in cross-cpu spinlock nightmare mode.
> >>>
> >>>IMO, the scaling problem for the above case might be
> >>>caused by either current percpu ida design or blk-mq's
> >>>usage on it.
> >>
> >>That is pretty much my claim, yes. Basically I don't think per-cpu tag
> >>caching is ever going to be the best solution for the combination of
> >>modern machines and the hardware that is out there (limited tags).
> >
> >Sorry for not being more active in the discussion earlier, but anyways - I'm in
> >100% agreement with this.
> >
> >Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
> >using all your available tags, because percpu sharding requires wasting your tag
> >space. I could write a mathematical proof of this if I cared enough.
> >
> >Otherwise what happens is on alloc failure you're touching all the other
> >cachelines every single time and now you're bouncing _more_ cachelines than if
> >you just had a single global freelist.
> >
> >So yeah, for small tag spaces just use a single simple bit vector on a single
> >cacheline.
> 
> I've taken the consequence of this and implemented another tagging scheme
> that blk-mq will use if it deems that percpu_ida isn't going to be effective
> for the device being initialized. But I really hate to have both of them in
> there. Unfortunately I have no devices available that have a tag space that
> will justify using percu_ida, so comparisons are a bit hard at the moment.
> NVMe should change that, though, so decision will have to be deferred until
> that is tested.

Yeah, I agree that is annoying. I've thought about the issue too though and I
haven't been able to come up with any better ideas, myself.

A given driver probably should be able to always use one or the other though, so
we shouldn't _need_ a runtime conditional because of this, though structuring
the code that way might be more trouble than it's worth from my vague
recollection of what blk-mq looks likee...

(I've actually been fighting with unrelated issues at a very similar layer of
abstraction, it's quite annoying.).

> >BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
> >when steal_tags() runs was fundamentally wrong and broken in this respect, and
> >should be reverted, whatever usage it was that was expecting to be able to
> >allocate the entire tag space was the problem.
> 
> It's hard to blame Shaohua, and I helped push that. It was an attempt in
> making percpu_ida actually useful for what blk-mq needs it for, and being
> the primary user of it, it was definitely worth doing. A tagging scheme that
> requires the tag space to be effectively sparse/huge to be fast is not a
> good generic tagging algorithm.

Yeah it was definitely not an unreasonable attempt and it's probably my own
fault for not speaking up louder at the time (I can't remember how much I
commented at the time). Ah well, irritating problem space :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-02  2:38                           ` Kent Overstreet
@ 2014-05-02  2:44                             ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2014-05-02  2:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Alexander Gordeev, Linux Kernel Mailing List,
	Shaohua Li, Nicholas Bellinger, Ingo Molnar, Peter Zijlstra

On 2014-05-01 20:38, Kent Overstreet wrote:
> On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
>> On 2014-05-01 16:47, Kent Overstreet wrote:
>>> On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
>>>> On 04/29/2014 05:35 AM, Ming Lei wrote:
>>>>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 2014-04-25 18:01, Ming Lei wrote:
>>>>>>>
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>>>>>
>>>>>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>>>>>> that's mostly because there's so much other crap going on in there. The
>>>>>>>> most effective way to currently make it work better, is just to ensure
>>>>>>>> the caching pool is of a sane size.
>>>>>>>
>>>>>>>
>>>>>>> Yes, that is just what the patch is doing, :-)
>>>>>>
>>>>>>
>>>>>> But it's not enough.
>>>>>
>>>>> Yes, the patch is only for cases of mutli hw queue and having
>>>>> offline CPUs existed.
>>>>>
>>>>>> For instance, my test case, it's 255 tags and 64 CPUs.
>>>>>> We end up in cross-cpu spinlock nightmare mode.
>>>>>
>>>>> IMO, the scaling problem for the above case might be
>>>>> caused by either current percpu ida design or blk-mq's
>>>>> usage on it.
>>>>
>>>> That is pretty much my claim, yes. Basically I don't think per-cpu tag
>>>> caching is ever going to be the best solution for the combination of
>>>> modern machines and the hardware that is out there (limited tags).
>>>
>>> Sorry for not being more active in the discussion earlier, but anyways - I'm in
>>> 100% agreement with this.
>>>
>>> Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
>>> using all your available tags, because percpu sharding requires wasting your tag
>>> space. I could write a mathematical proof of this if I cared enough.
>>>
>>> Otherwise what happens is on alloc failure you're touching all the other
>>> cachelines every single time and now you're bouncing _more_ cachelines than if
>>> you just had a single global freelist.
>>>
>>> So yeah, for small tag spaces just use a single simple bit vector on a single
>>> cacheline.
>>
>> I've taken the consequence of this and implemented another tagging scheme
>> that blk-mq will use if it deems that percpu_ida isn't going to be effective
>> for the device being initialized. But I really hate to have both of them in
>> there. Unfortunately I have no devices available that have a tag space that
>> will justify using percu_ida, so comparisons are a bit hard at the moment.
>> NVMe should change that, though, so decision will have to be deferred until
>> that is tested.
>
> Yeah, I agree that is annoying. I've thought about the issue too though and I
> haven't been able to come up with any better ideas, myself.

I have failed in that area, too, and it's not for lack of thinking about 
it and experimenting. So hence a new solution was thought up, based on a 
lot of userland prototyping and testing. Things considered, two 
solutions is better than no solution.

> A given driver probably should be able to always use one or the other though, so
> we shouldn't _need_ a runtime conditional because of this, though structuring
> the code that way might be more trouble than it's worth from my vague
> recollection of what blk-mq looks likee...

It's completely runtime conditional at this point, not sure how not to 
make it so. This is transparent to drivers, they should not care about 
what kind of tagging scheme to use. If we present that, then we have 
failed. The runtime conditional is still better than a function pointer, 
though, so it'll likely stay that way for now.

So it the entry points now all look like this:

if (use_new_scheme)
   ret = new_foo();
else
   ret = foo();

At least it should branch predict really well :-)

> (I've actually been fighting with unrelated issues at a very similar layer of
> abstraction, it's quite annoying.).
>
>>> BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
>>> when steal_tags() runs was fundamentally wrong and broken in this respect, and
>>> should be reverted, whatever usage it was that was expecting to be able to
>>> allocate the entire tag space was the problem.
>>
>> It's hard to blame Shaohua, and I helped push that. It was an attempt in
>> making percpu_ida actually useful for what blk-mq needs it for, and being
>> the primary user of it, it was definitely worth doing. A tagging scheme that
>> requires the tag space to be effectively sparse/huge to be fast is not a
>> good generic tagging algorithm.
>
> Yeah it was definitely not an unreasonable attempt and it's probably my own
> fault for not speaking up louder at the time (I can't remember how much I
> commented at the time). Ah well, irritating problem space :)

Not a problem, I think the main failure here is that we have been coming 
at this with clashing expectations of what the requirements are. And a 
further issue is wanting to cling to the percpu_ida tags on my end, 
thinking it could be made to work.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-02  2:19                         ` Jens Axboe
  2014-05-02  2:38                           ` Kent Overstreet
@ 2014-05-02  5:05                           ` Christoph Hellwig
  2014-05-02 16:41                             ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2014-05-02  5:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, Ming Lei, Alexander Gordeev,
	Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger,
	Ingo Molnar, Peter Zijlstra

On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
> I've taken the consequence of this and implemented another tagging
> scheme that blk-mq will use if it deems that percpu_ida isn't going
> to be effective for the device being initialized. But I really hate
> to have both of them in there. Unfortunately I have no devices
> available that have a tag space that will justify using percu_ida,
> so comparisons are a bit hard at the moment. NVMe should change
> that, though, so decision will have to be deferred until that is
> tested.

At least for SCSI devices _tag space_ is plenty, it's just the we
artifically limit our tag space to the queue depth to avoid having to
track that one separately.  In addition we also preallocaste a request
for each tag, so even if we would track the queue depth separately
we would waste a lot of memory.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-02  5:05                           ` Christoph Hellwig
@ 2014-05-02 16:41                             ` Jens Axboe
  2014-05-02 16:43                               ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2014-05-02 16:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kent Overstreet, Ming Lei, Alexander Gordeev,
	Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger,
	Ingo Molnar, Peter Zijlstra

On 05/01/2014 11:05 PM, Christoph Hellwig wrote:
> On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
>> I've taken the consequence of this and implemented another tagging
>> scheme that blk-mq will use if it deems that percpu_ida isn't going
>> to be effective for the device being initialized. But I really hate
>> to have both of them in there. Unfortunately I have no devices
>> available that have a tag space that will justify using percu_ida,
>> so comparisons are a bit hard at the moment. NVMe should change
>> that, though, so decision will have to be deferred until that is
>> tested.
> 
> At least for SCSI devices _tag space_ is plenty, it's just the we
> artifically limit our tag space to the queue depth to avoid having to
> track that one separately.  In addition we also preallocaste a request
> for each tag, so even if we would track the queue depth separately
> we would waste a lot of memory.

In practice it comes out to the same, it's not feasible to run a much
larger space and track on queue depth. So I don't think that changes the
conclusion for SCSI.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-02 16:41                             ` Jens Axboe
@ 2014-05-02 16:43                               ` Christoph Hellwig
  2014-05-02 16:56                                 ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2014-05-02 16:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kent Overstreet, Ming Lei, Alexander Gordeev,
	Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger,
	Ingo Molnar, Peter Zijlstra

On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote:
> > At least for SCSI devices _tag space_ is plenty, it's just the we
> > artifically limit our tag space to the queue depth to avoid having to
> > track that one separately.  In addition we also preallocaste a request
> > for each tag, so even if we would track the queue depth separately
> > we would waste a lot of memory.
> 
> In practice it comes out to the same, it's not feasible to run a much
> larger space and track on queue depth. So I don't think that changes the
> conclusion for SCSI.

Agreed.  Just had to check if my smart ass hat still fits this morning..


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-02 16:43                               ` Christoph Hellwig
@ 2014-05-02 16:56                                 ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2014-05-02 16:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kent Overstreet, Ming Lei, Alexander Gordeev,
	Linux Kernel Mailing List, Shaohua Li, Nicholas Bellinger,
	Ingo Molnar, Peter Zijlstra

On 05/02/2014 10:43 AM, Christoph Hellwig wrote:
> On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote:
>>> At least for SCSI devices _tag space_ is plenty, it's just the we
>>> artifically limit our tag space to the queue depth to avoid having to
>>> track that one separately.  In addition we also preallocaste a request
>>> for each tag, so even if we would track the queue depth separately
>>> we would waste a lot of memory.
>>
>> In practice it comes out to the same, it's not feasible to run a much
>> larger space and track on queue depth. So I don't think that changes the
>> conclusion for SCSI.
> 
> Agreed.  Just had to check if my smart ass hat still fits this morning..

If not, these are on sale

http://www.cafepress.com/+smartass_trucker_hat,18823771

:-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
                   ` (2 preceding siblings ...)
  2014-04-22  7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
@ 2014-04-22 11:56 ` Peter Zijlstra
  2014-05-01 21:24   ` Alexander Gordeev
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2014-04-22 11:56 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Kent Overstreet, Jens Axboe, Shaohua Li,
	Nicholas Bellinger, Ingo Molnar

On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
> Hello,
> 
> This series is against 3.14.0-rc7.
> 
> It is amied to further improve 'percpu_ida' tags locality by taking
> into account system's CPU topology when stealing tags. That is try
> to steal from a CPU which is 'closest' to the stealing one.
> 
> I would not bother to post this, since on several system the change
> did not show any improvement, i.e. on such one:

There's is much lower level fruit to be had before doing something like
this.

  https://lkml.org/lkml/2014/1/23/257
  https://lkml.org/lkml/2014/1/23/329
  https://lkml.org/lkml/2014/1/23/343
  https://lkml.org/lkml/2014/1/23/354

I've not had time to revisit/finish them, but you should definitely
clean up the percpu_ida stuff and reduce existing contention before
going overboard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-04-22 11:56 ` Peter Zijlstra
@ 2014-05-01 21:24   ` Alexander Gordeev
  2014-05-01 22:04     ` Alexander Gordeev
  0 siblings, 1 reply; 26+ messages in thread
From: Alexander Gordeev @ 2014-05-01 21:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Kent Overstreet, Jens Axboe, Shaohua Li,
	Nicholas Bellinger, Ingo Molnar

On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote:
> I've not had time to revisit/finish them, but you should definitely
> clean up the percpu_ida stuff and reduce existing contention before
> going overboard.

Hi Peter,

I tried to combine your series into a single patch against 3.15-rc3.

I hope, it looks like you intended and my comments in follow-up
message make sense.

Thanks!

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 93d145e..14f5151 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr,
 	*dst_nr += nr;
 }
 
+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 /*
  * Try to steal tags from a remote cpu's percpu freelist.
  *
@@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool,
 		if (remote == tags)
 			continue;
 
-		spin_lock(&remote->lock);
+		double_lock(&tags->lock, &remote->lock);
 
 		if (remote->nr_free) {
 			memcpy(tags->freelist,
@@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool,
 		}
 
 		spin_unlock(&remote->lock);
+		spin_unlock(&tags->lock);
 
 		if (tags->nr_free)
 			break;
 	}
 }
 
-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
-				     struct percpu_ida_cpu *tags)
+static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
 {
-	move_tags(tags->freelist, &tags->nr_free,
-		  pool->freelist, &pool->nr_free,
-		  min(pool->nr_free, pool->percpu_batch_size));
+	int tag = -ENOSPC;
+
+	spin_lock(&tags->lock);
+	if (tags->nr_free)
+		tag = tags->freelist[--tags->nr_free];
+	spin_unlock(&tags->lock);
+
+	return tag;
 }
 
-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
+static inline int alloc_local_tag(struct percpu_ida *pool)
 {
+	struct percpu_ida_cpu *tags;
+	unsigned long flags;
 	int tag = -ENOSPC;
 
+	local_irq_save(flags);
+	tags = this_cpu_ptr(pool->tag_cpu);
 	spin_lock(&tags->lock);
 	if (tags->nr_free)
 		tag = tags->freelist[--tags->nr_free];
-	spin_unlock(&tags->lock);
+	spin_unlock_irqrestore(&tags->lock, flags);
 
 	return tag;
 }
 
+static inline int alloc_global_tag(struct percpu_ida *pool)
+{
+	struct percpu_ida_cpu *tags;
+	unsigned long flags;
+ 	int tag = -ENOSPC;
+ 
+	spin_lock_irqsave(&pool->lock, flags);
+	tags = this_cpu_ptr(pool->tag_cpu);
+
+	if (!tags->nr_free) {
+		/*
+		 * Move tags from the global-, onto our percpu-freelist.
+		 */
+		move_tags(tags->freelist, &tags->nr_free,
+			  pool->freelist, &pool->nr_free,
+			  min(pool->nr_free, pool->percpu_batch_size));
+	}
+
+	spin_unlock(&pool->lock);
+
+	if (!tags->nr_free)
+		steal_tags(pool, tags);
+
+	if (tags->nr_free) {
+		spin_lock(&tags->lock);
+		if (tags->nr_free) {
+	 		tag = tags->freelist[--tags->nr_free];
+			if (tags->nr_free) {
+				cpumask_set_cpu(smp_processor_id(),
+						&pool->cpus_have_tags);
+			}
+		}
+		spin_unlock(&tags->lock);
+	}
+
+	local_irq_restore(flags);
+ 
+ 	return tag;
+}
+
 /**
  * percpu_ida_alloc - allocate a tag
  * @pool: pool to allocate from
@@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  */
 int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
-	DEFINE_WAIT(wait);
-	struct percpu_ida_cpu *tags;
-	unsigned long flags;
-	int tag;
-
-	local_irq_save(flags);
-	tags = this_cpu_ptr(pool->tag_cpu);
-
-	/* Fastpath */
-	tag = alloc_local_tag(tags);
-	if (likely(tag >= 0)) {
-		local_irq_restore(flags);
-		return tag;
-	}
-
-	while (1) {
-		spin_lock(&pool->lock);
-
-		/*
-		 * prepare_to_wait() must come before steal_tags(), in case
-		 * percpu_ida_free() on another cpu flips a bit in
-		 * cpus_have_tags
-		 *
-		 * global lock held and irqs disabled, don't need percpu lock
-		 */
-		if (state != TASK_RUNNING)
-			prepare_to_wait(&pool->wait, &wait, state);
-
-		if (!tags->nr_free)
-			alloc_global_tags(pool, tags);
-		if (!tags->nr_free)
-			steal_tags(pool, tags);
-
-		if (tags->nr_free) {
-			tag = tags->freelist[--tags->nr_free];
-			if (tags->nr_free)
-				cpumask_set_cpu(smp_processor_id(),
-						&pool->cpus_have_tags);
-		}
-
-		spin_unlock(&pool->lock);
-		local_irq_restore(flags);
-
-		if (tag >= 0 || state == TASK_RUNNING)
-			break;
-
-		if (signal_pending_state(state, current)) {
-			tag = -ERESTARTSYS;
-			break;
-		}
-
-		schedule();
-
-		local_irq_save(flags);
-		tags = this_cpu_ptr(pool->tag_cpu);
-	}
-	if (state != TASK_RUNNING)
-		finish_wait(&pool->wait, &wait);
-
-	return tag;
+	int ret;
+
+	ret = alloc_local_tag(pool);
+	if (likely(ret >= 0))
+		return ret;
+
+	if (state != TASK_RUNNING) {
+		int tag;
+ 
+		ret = ___wait_event(pool->wait,
+				    (tag = alloc_global_tag(pool)) >= 0,
+				    state, 0, 0, schedule());
+ 
+		if (tag >= 0)
+			ret = tag;
+	} else {
+		ret = alloc_global_tag(pool);
+ 	}
+ 
+	return ret;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_alloc);
 
@@ -239,12 +253,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
 		wake_up(&pool->wait);
 	}
 
+	/*
+	 * No point in waking when 1 < n < max_size free, because steal_tags()
+	 * will assume max_size per set bit, having more free will not make it
+	 * more or less likely it will visit this cpu.
+	 */
+
 	if (nr_free == pool->percpu_max_size) {
 		spin_lock(&pool->lock);
 
 		/*
-		 * Global lock held and irqs disabled, don't need percpu
-		 * lock
+		 * Global lock held and irqs disabled, don't need percpu lock
+		 * because everybody accessing remote @tags will hold
+		 * pool->lock -- steal_tags().
 		 */
 		if (tags->nr_free == pool->percpu_max_size) {
 			move_tags(pool->freelist, &pool->nr_free,
@@ -344,33 +365,31 @@ EXPORT_SYMBOL_GPL(__percpu_ida_init);
 int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
 	void *data)
 {
-	unsigned long flags;
 	struct percpu_ida_cpu *remote;
-	unsigned cpu, i, err = 0;
+	unsigned long flags;
+	int cpu, i, err = 0;
 
-	local_irq_save(flags);
 	for_each_possible_cpu(cpu) {
 		remote = per_cpu_ptr(pool->tag_cpu, cpu);
-		spin_lock(&remote->lock);
+		spin_lock_irqsave(&remote->lock, flags);
 		for (i = 0; i < remote->nr_free; i++) {
 			err = fn(remote->freelist[i], data);
 			if (err)
 				break;
 		}
-		spin_unlock(&remote->lock);
+		spin_unlock_irqrestore(&remote->lock, flags);
 		if (err)
-			goto out;
+			return err;
 	}
 
-	spin_lock(&pool->lock);
+	spin_lock_irqsave(&pool->lock, flags);
 	for (i = 0; i < pool->nr_free; i++) {
 		err = fn(pool->freelist[i], data);
 		if (err)
 			break;
 	}
-	spin_unlock(&pool->lock);
-out:
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&pool->lock, flags);
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);
-- 
1.7.7.6


-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
  2014-05-01 21:24   ` Alexander Gordeev
@ 2014-05-01 22:04     ` Alexander Gordeev
  0 siblings, 0 replies; 26+ messages in thread
From: Alexander Gordeev @ 2014-05-01 22:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Kent Overstreet, Jens Axboe, Shaohua Li,
	Nicholas Bellinger, Ingo Molnar

On Thu, May 01, 2014 at 11:24:42PM +0200, Alexander Gordeev wrote:
> On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote:
> > I've not had time to revisit/finish them, but you should definitely
> > clean up the percpu_ida stuff and reduce existing contention before
> > going overboard.
> 
> Hi Peter,
> 
> I tried to combine your series into a single patch against 3.15-rc3.
> 
> I hope, it looks like you intended and my comments in follow-up
> message make sense.
> 
> Thanks!
> 
> diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> index 93d145e..14f5151 100644
> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr,
>  	*dst_nr += nr;
>  }
>  
> +static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
> +{
> +	if (l1 > l2)
> +		swap(l1, l2);
> +
> +	spin_lock(l1);
> +	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
> +}
> +
>  /*
>   * Try to steal tags from a remote cpu's percpu freelist.
>   *
> @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool,
>  		if (remote == tags)
>  			continue;
>  
> -		spin_lock(&remote->lock);
> +		double_lock(&tags->lock, &remote->lock);
>  
>  		if (remote->nr_free) {
>  			memcpy(tags->freelist,
> @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool,
>  		}
>  
>  		spin_unlock(&remote->lock);
> +		spin_unlock(&tags->lock);
>  
>  		if (tags->nr_free)
>  			break;
>  	}
>  }
>  
> -/*
> - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
> - * our percpu freelist:
> - */
> -static inline void alloc_global_tags(struct percpu_ida *pool,
> -				     struct percpu_ida_cpu *tags)
> +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
>  {
> -	move_tags(tags->freelist, &tags->nr_free,
> -		  pool->freelist, &pool->nr_free,
> -		  min(pool->nr_free, pool->percpu_batch_size));
> +	int tag = -ENOSPC;
> +
> +	spin_lock(&tags->lock);
> +	if (tags->nr_free)
> +		tag = tags->freelist[--tags->nr_free];
> +	spin_unlock(&tags->lock);
> +
> +	return tag;
>  }
>  
> -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
> +static inline int alloc_local_tag(struct percpu_ida *pool)
>  {
> +	struct percpu_ida_cpu *tags;
> +	unsigned long flags;
>  	int tag = -ENOSPC;
>  
> +	local_irq_save(flags);
> +	tags = this_cpu_ptr(pool->tag_cpu);
>  	spin_lock(&tags->lock);
>  	if (tags->nr_free)
>  		tag = tags->freelist[--tags->nr_free];
> -	spin_unlock(&tags->lock);
> +	spin_unlock_irqrestore(&tags->lock, flags);
>  
>  	return tag;
>  }
>  
> +static inline int alloc_global_tag(struct percpu_ida *pool)
> +{
> +	struct percpu_ida_cpu *tags;
> +	unsigned long flags;
> + 	int tag = -ENOSPC;
> + 
> +	spin_lock_irqsave(&pool->lock, flags);
> +	tags = this_cpu_ptr(pool->tag_cpu);

If it makes more sense to make it the way alloc_local_tag() does?

	local_irq_save(flags);
	tags = this_cpu_ptr(pool->tag_cpu);
 	spin_lock(&pool->lock);

> +
> +	if (!tags->nr_free) {
> +		/*
> +		 * Move tags from the global-, onto our percpu-freelist.
> +		 */
> +		move_tags(tags->freelist, &tags->nr_free,
> +			  pool->freelist, &pool->nr_free,
> +			  min(pool->nr_free, pool->percpu_batch_size));

'tags' are accessed lock-less and race with steal_tags().
If double_lock() would help here?

> +	}
> +
> +	spin_unlock(&pool->lock);

It is quite difficult to comment here, since there is no affected code shown,
but I will try nevertheless.

So steal_tags() used to serialized with 'pool->lock', which is removed.
As result, two (or more) concurrent threads could find and unset the very
same bit in this piece of code:

	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
	     cpus_have_tags; cpus_have_tags--) {
		cpu = cpumask_next(cpu, &pool->cpus_have_tags);	// *** FIND

		if (cpu >= nr_cpu_ids) {
			cpu = cpumask_first(&pool->cpus_have_tags);
			if (cpu >= nr_cpu_ids)
				BUG();
		}

		pool->cpu_last_stolen = cpu;
		remote = per_cpu_ptr(pool->tag_cpu, cpu);

		cpumask_clear_cpu(cpu, &pool->cpus_have_tags);	// *** UNSET

		...
	}


The second's thread cpumask_clear_cpu() races with cpumask_set_cpu() in
percpu_ida_free()

        if (nr_free == 1) {
                cpumask_set_cpu(smp_processor_id(),
                                &pool->cpus_have_tags);

		// here the bit is UNSET by the second concurrent thread

               	wake_up(&pool->wait);
        }

Which results in the woken thread would not find the illegitimately unset
bit while looping over the mask in steal_tags(). IOW, the tag might get
"lost".

> +	if (!tags->nr_free)
> +		steal_tags(pool, tags);
> +
> +	if (tags->nr_free) {
> +		spin_lock(&tags->lock);
> +		if (tags->nr_free) {
> +	 		tag = tags->freelist[--tags->nr_free];
> +			if (tags->nr_free) {
> +				cpumask_set_cpu(smp_processor_id(),
> +						&pool->cpus_have_tags);
> +			}
> +		}
> +		spin_unlock(&tags->lock);
> +	}
> +
> +	local_irq_restore(flags);
> + 
> + 	return tag;
> +}
> +
>  /**
>   * percpu_ida_alloc - allocate a tag
>   * @pool: pool to allocate from
> @@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
>   */
>  int percpu_ida_alloc(struct percpu_ida *pool, int state)
>  {
> -	DEFINE_WAIT(wait);
> -	struct percpu_ida_cpu *tags;
> -	unsigned long flags;
> -	int tag;
> -
> -	local_irq_save(flags);
> -	tags = this_cpu_ptr(pool->tag_cpu);
> -
> -	/* Fastpath */
> -	tag = alloc_local_tag(tags);
> -	if (likely(tag >= 0)) {
> -		local_irq_restore(flags);
> -		return tag;
> -	}
> -
> -	while (1) {
> -		spin_lock(&pool->lock);
> -
> -		/*
> -		 * prepare_to_wait() must come before steal_tags(), in case
> -		 * percpu_ida_free() on another cpu flips a bit in
> -		 * cpus_have_tags
> -		 *
> -		 * global lock held and irqs disabled, don't need percpu lock
> -		 */
> -		if (state != TASK_RUNNING)
> -			prepare_to_wait(&pool->wait, &wait, state);
> -
> -		if (!tags->nr_free)
> -			alloc_global_tags(pool, tags);
> -		if (!tags->nr_free)
> -			steal_tags(pool, tags);
> -
> -		if (tags->nr_free) {
> -			tag = tags->freelist[--tags->nr_free];
> -			if (tags->nr_free)
> -				cpumask_set_cpu(smp_processor_id(),
> -						&pool->cpus_have_tags);
> -		}
> -
> -		spin_unlock(&pool->lock);
> -		local_irq_restore(flags);
> -
> -		if (tag >= 0 || state == TASK_RUNNING)
> -			break;
> -
> -		if (signal_pending_state(state, current)) {
> -			tag = -ERESTARTSYS;
> -			break;
> -		}
> -
> -		schedule();
> -
> -		local_irq_save(flags);
> -		tags = this_cpu_ptr(pool->tag_cpu);
> -	}
> -	if (state != TASK_RUNNING)
> -		finish_wait(&pool->wait, &wait);
> -
> -	return tag;
> +	int ret;
> +
> +	ret = alloc_local_tag(pool);
> +	if (likely(ret >= 0))
> +		return ret;
> +
> +	if (state != TASK_RUNNING) {
> +		int tag;
> + 
> +		ret = ___wait_event(pool->wait,
> +				    (tag = alloc_global_tag(pool)) >= 0,
> +				    state, 0, 0, schedule());
> + 
> +		if (tag >= 0)
> +			ret = tag;
> +	} else {
> +		ret = alloc_global_tag(pool);
> + 	}
> + 
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(percpu_ida_alloc);
>  
> @@ -239,12 +253,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
>  		wake_up(&pool->wait);
>  	}
>  
> +	/*
> +	 * No point in waking when 1 < n < max_size free, because steal_tags()
> +	 * will assume max_size per set bit, having more free will not make it
> +	 * more or less likely it will visit this cpu.
> +	 */
> +
>  	if (nr_free == pool->percpu_max_size) {
>  		spin_lock(&pool->lock);
>  
>  		/*
> -		 * Global lock held and irqs disabled, don't need percpu
> -		 * lock
> +		 * Global lock held and irqs disabled, don't need percpu lock
> +		 * because everybody accessing remote @tags will hold
> +		 * pool->lock -- steal_tags().
>  		 */
>  		if (tags->nr_free == pool->percpu_max_size) {
>  			move_tags(pool->freelist, &pool->nr_free,
> @@ -344,33 +365,31 @@ EXPORT_SYMBOL_GPL(__percpu_ida_init);
>  int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
>  	void *data)
>  {
> -	unsigned long flags;
>  	struct percpu_ida_cpu *remote;
> -	unsigned cpu, i, err = 0;
> +	unsigned long flags;
> +	int cpu, i, err = 0;
>  
> -	local_irq_save(flags);
>  	for_each_possible_cpu(cpu) {
>  		remote = per_cpu_ptr(pool->tag_cpu, cpu);
> -		spin_lock(&remote->lock);
> +		spin_lock_irqsave(&remote->lock, flags);
>  		for (i = 0; i < remote->nr_free; i++) {
>  			err = fn(remote->freelist[i], data);
>  			if (err)
>  				break;
>  		}
> -		spin_unlock(&remote->lock);
> +		spin_unlock_irqrestore(&remote->lock, flags);
>  		if (err)
> -			goto out;
> +			return err;
>  	}
>  
> -	spin_lock(&pool->lock);
> +	spin_lock_irqsave(&pool->lock, flags);
>  	for (i = 0; i < pool->nr_free; i++) {
>  		err = fn(pool->freelist[i], data);
>  		if (err)
>  			break;
>  	}
> -	spin_unlock(&pool->lock);
> -out:
> -	local_irq_restore(flags);
> +	spin_unlock_irqrestore(&pool->lock, flags);
> +
>  	return err;
>  }
>  EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-05-02 16:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-26 13:34 [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
2014-03-26 13:34 ` [PATCH RFC 1/2] sched: Introduce topology level masks and for_each_tlm() macro Alexander Gordeev
2014-03-26 13:34 ` [PATCH RFC 2/2] percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() Alexander Gordeev
2014-04-22  7:10 ` [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags Alexander Gordeev
2014-04-22 14:03   ` Jens Axboe
2014-04-22 15:57     ` Jens Axboe
2014-04-23  0:53       ` Ming Lei
2014-04-23  1:25         ` Jens Axboe
2014-04-25  9:10           ` Ming Lei
2014-04-25 21:23             ` Jens Axboe
2014-04-26  0:01               ` Ming Lei
2014-04-26  2:03                 ` Jens Axboe
2014-04-29 11:35                   ` Ming Lei
2014-04-29 21:13                     ` Jens Axboe
2014-04-30  9:40                       ` Ming Lei
2014-05-01 22:47                       ` Kent Overstreet
2014-05-02  2:19                         ` Jens Axboe
2014-05-02  2:38                           ` Kent Overstreet
2014-05-02  2:44                             ` Jens Axboe
2014-05-02  5:05                           ` Christoph Hellwig
2014-05-02 16:41                             ` Jens Axboe
2014-05-02 16:43                               ` Christoph Hellwig
2014-05-02 16:56                                 ` Jens Axboe
2014-04-22 11:56 ` Peter Zijlstra
2014-05-01 21:24   ` Alexander Gordeev
2014-05-01 22:04     ` Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox