__assign_irq_vector (x86) and irq vectors exhaust

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* __assign_irq_vector (x86) and irq vectors exhaust
       [not found] <20150518124833.GA7512@peter-bsd.cuba.int>
  2015-05-18 12:48 ` __assign_irq_vector (x86) and irq vectors exhaust p_kosyh
@ 2015-05-18 12:48 ` p_kosyh
  1 sibling, 0 replies; 5+ messages in thread
From: p_kosyh @ 2015-05-18 12:48 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2230 bytes --]

Hello all!

Playing with 10Gb network adapters and /proc/irq/<nr>/smp_affinity 
we found that sometimes we can not move interrupt on selected cpu.

After digging source code we found, that
arch/x86/kernel/apic/vector.c: __assign_irq_vector (4.0 kernel)
allocates vectors in not optimal way.

For example, we have a 32 cpu system with lot of 10Gb cards (each of
them has 32 msi-x irqs). Even if card is not used, it allocates an irq
vector after probing (pci_enable_msix()). We have about ~200 vectors limit 
per cpu (on x86), and __assign_irq_vector allocates them filling cpus one 
by one (see at cpumask_first_and()):

	...

	cpumask_clear(cfg->old_domain);
	cpu = cpumask_first_and(mask, cpu_online_mask);
	/* here we are got 1st non zero bit <----------- */
	while (cpu < nr_cpu_ids) {
		int new_cpu, vector, offset;

		apic->vector_allocation_domain(cpu, tmp_mask, mask);

		...

		if (unlikely(current_vector == vector)) {
		cpumask_or(cfg->old_domain, cfg->old_domain,
tmp_mask);
			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
			cpu = cpumask_first_and(tmp_mask,
cpu_online_mask);
			/* get next non zero bit <------------ */
			continue;
		}

		...

So, after our system is up, we have a situation when some cpus
has no free vectors at all!! And some cpus has all vectors free.

Userspace do not know nothing about this exhaust!!! So after writing 
mask to the smp_affinity we can got a situation that irq can not be moved.
Silently.

It is not a critical thing when you are doing all stuff by
hands, but if we are using irq balancer, like birq (http://birq.libcode.org) 
or any other, this problem becomes critical one! Balancer has not idea, why irq
is still not moved!!! Btw, the other problem is napi and softirq sticking
(http://comments.gmane.org/gmane.linux.network/322914). But i
already wrote about this problem and possible solution.

Anyway, it's like a bad idea to allocate cpu one after one and not to sparse 
irq vectors.

The solution is simple. Instead of using cpumask_first_and(), try to get
RANDOM bit. I wrote dirty realization that works for me. Of
course, it must be done in right way, but i have attached patch for
illustration. 

Hope, it help someone else....

Thank you!

-- 
Peter Kosyh

[-- Attachment #2: linux-3.10-irq-rr.patch --]
[-- Type: text/x-diff, Size: 1914 bytes --]

diff -Nur linux-3.10.65/arch/x86/kernel/apic/io_apic.c linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c
--- linux-3.10.65/arch/x86/kernel/apic/io_apic.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c	2015-05-14 10:29:45.618572555 +0300
@@ -1072,6 +1072,7 @@
 {
 	raw_spin_unlock(&vector_lock);
 }
+extern int cpumask_any_and_real(const struct cpumask *src1p, const struct cpumask *src2p);

 static int
 __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
@@ -1101,7 +1102,7 @@
 	/* Only try and allocate irqs on cpus that are present */
 	err = -ENOSPC;
 	cpumask_clear(cfg->old_domain);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = cpumask_any_and_real(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;

@@ -1135,7 +1136,7 @@
 		if (unlikely(current_vector == vector)) {
 			cpumask_or(cfg->old_domain, cfg->old_domain, tmp_mask);
 			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
-			cpu = cpumask_first_and(tmp_mask, cpu_online_mask);
+			cpu = cpumask_any_and_real(tmp_mask, cpu_online_mask);
 			continue;
 		}

diff -Nur linux-3.10.65/lib/cpumask.c linux-3.10.65-irq-rr/lib/cpumask.c
--- linux-3.10.65/lib/cpumask.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/lib/cpumask.c	2015-05-14 10:28:28.410574546 +0300
@@ -164,3 +164,31 @@
 	free_bootmem(__pa(mask), cpumask_size());
 }
 #endif
+
+int cpumask_any_and_real(const struct cpumask *src1p,
+		const struct cpumask *src2p)
+{
+	int i;
+	int n = 0;
+	static int seed = 0;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		n ++;
+	}
+
+	if (!n) /* no cpus */
+		return nr_cpu_ids;
+
+	n = (seed ^ jiffies) % n;
+
+	seed ++;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		if (!n)
+			return i;
+		n --;
+	}
+	return nr_cpu_ids;
+}
+
+EXPORT_SYMBOL(cpumask_any_and_real);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* __assign_irq_vector (x86) and irq vectors exhaust
       [not found] <20150518124833.GA7512@peter-bsd.cuba.int>
@ 2015-05-18 12:48 ` p_kosyh
  2015-05-18 12:48 ` p_kosyh
  1 sibling, 0 replies; 5+ messages in thread
From: p_kosyh @ 2015-05-18 12:48 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2230 bytes --]

Hello all!

Playing with 10Gb network adapters and /proc/irq/<nr>/smp_affinity 
we found that sometimes we can not move interrupt on selected cpu.

After digging source code we found, that
arch/x86/kernel/apic/vector.c: __assign_irq_vector (4.0 kernel)
allocates vectors in not optimal way.

For example, we have a 32 cpu system with lot of 10Gb cards (each of
them has 32 msi-x irqs). Even if card is not used, it allocates an irq
vector after probing (pci_enable_msix()). We have about ~200 vectors limit 
per cpu (on x86), and __assign_irq_vector allocates them filling cpus one 
by one (see at cpumask_first_and()):

	...

	cpumask_clear(cfg->old_domain);
	cpu = cpumask_first_and(mask, cpu_online_mask);
	/* here we are got 1st non zero bit <----------- */
	while (cpu < nr_cpu_ids) {
		int new_cpu, vector, offset;

		apic->vector_allocation_domain(cpu, tmp_mask, mask);

		...

		if (unlikely(current_vector == vector)) {
		cpumask_or(cfg->old_domain, cfg->old_domain,
tmp_mask);
			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
			cpu = cpumask_first_and(tmp_mask,
cpu_online_mask);
			/* get next non zero bit <------------ */
			continue;
		}

		...

So, after our system is up, we have a situation when some cpus
has no free vectors at all!! And some cpus has all vectors free.

Userspace do not know nothing about this exhaust!!! So after writing 
mask to the smp_affinity we can got a situation that irq can not be moved.
Silently.

It is not a critical thing when you are doing all stuff by
hands, but if we are using irq balancer, like birq (http://birq.libcode.org) 
or any other, this problem becomes critical one! Balancer has not idea, why irq
is still not moved!!! Btw, the other problem is napi and softirq sticking
(http://comments.gmane.org/gmane.linux.network/322914). But i
already wrote about this problem and possible solution.

Anyway, it's like a bad idea to allocate cpu one after one and not to sparse 
irq vectors.

The solution is simple. Instead of using cpumask_first_and(), try to get
RANDOM bit. I wrote dirty realization that works for me. Of
course, it must be done in right way, but i have attached patch for
illustration. 

Hope, it help someone else....

Thank you!

-- 
Peter Kosyh

[-- Attachment #2: linux-3.10-irq-rr.patch --]
[-- Type: text/x-diff, Size: 1914 bytes --]

diff -Nur linux-3.10.65/arch/x86/kernel/apic/io_apic.c linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c
--- linux-3.10.65/arch/x86/kernel/apic/io_apic.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c	2015-05-14 10:29:45.618572555 +0300
@@ -1072,6 +1072,7 @@
 {
 	raw_spin_unlock(&vector_lock);
 }
+extern int cpumask_any_and_real(const struct cpumask *src1p, const struct cpumask *src2p);

 static int
 __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
@@ -1101,7 +1102,7 @@
 	/* Only try and allocate irqs on cpus that are present */
 	err = -ENOSPC;
 	cpumask_clear(cfg->old_domain);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = cpumask_any_and_real(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;

@@ -1135,7 +1136,7 @@
 		if (unlikely(current_vector == vector)) {
 			cpumask_or(cfg->old_domain, cfg->old_domain, tmp_mask);
 			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
-			cpu = cpumask_first_and(tmp_mask, cpu_online_mask);
+			cpu = cpumask_any_and_real(tmp_mask, cpu_online_mask);
 			continue;
 		}

diff -Nur linux-3.10.65/lib/cpumask.c linux-3.10.65-irq-rr/lib/cpumask.c
--- linux-3.10.65/lib/cpumask.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/lib/cpumask.c	2015-05-14 10:28:28.410574546 +0300
@@ -164,3 +164,31 @@
 	free_bootmem(__pa(mask), cpumask_size());
 }
 #endif
+
+int cpumask_any_and_real(const struct cpumask *src1p,
+		const struct cpumask *src2p)
+{
+	int i;
+	int n = 0;
+	static int seed = 0;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		n ++;
+	}
+
+	if (!n) /* no cpus */
+		return nr_cpu_ids;
+
+	n = (seed ^ jiffies) % n;
+
+	seed ++;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		if (!n)
+			return i;
+		n --;
+	}
+	return nr_cpu_ids;
+}
+
+EXPORT_SYMBOL(cpumask_any_and_real);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* __assign_irq_vector (x86) and irq vectors exhaust
@ 2015-05-18 12:57 p.kosyh
  2015-05-18 13:40 ` David Laight
  0 siblings, 1 reply; 5+ messages in thread
From: p.kosyh @ 2015-05-18 12:57 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2230 bytes --]

Hello all!

Playing with 10Gb network adapters and /proc/irq/<nr>/smp_affinity 
we found that sometimes we can not move interrupt on selected cpu.

After digging source code we found, that
arch/x86/kernel/apic/vector.c: __assign_irq_vector (4.0 kernel)
allocates vectors in not optimal way.

For example, we have a 32 cpu system with lot of 10Gb cards (each of
them has 32 msi-x irqs). Even if card is not used, it allocates an irq
vector after probing (pci_enable_msix()). We have about ~200 vectors limit 
per cpu (on x86), and __assign_irq_vector allocates them filling cpus one 
by one (see at cpumask_first_and()):

	...

	cpumask_clear(cfg->old_domain);
	cpu = cpumask_first_and(mask, cpu_online_mask);
	/* here we are got 1st non zero bit <----------- */
	while (cpu < nr_cpu_ids) {
		int new_cpu, vector, offset;

		apic->vector_allocation_domain(cpu, tmp_mask, mask);

		...

		if (unlikely(current_vector == vector)) {
		cpumask_or(cfg->old_domain, cfg->old_domain,
tmp_mask);
			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
			cpu = cpumask_first_and(tmp_mask,
cpu_online_mask);
			/* get next non zero bit <------------ */
			continue;
		}

		...

So, after our system is up, we have a situation when some cpus
has no free vectors at all!! And some cpus has all vectors free.

Userspace do not know nothing about this exhaust!!! So after writing 
mask to the smp_affinity we can got a situation that irq can not be moved.
Silently.

It is not a critical thing when you are doing all stuff by
hands, but if we are using irq balancer, like birq (http://birq.libcode.org) 
or any other, this problem becomes critical one! Balancer has not idea, why irq
is still not moved!!! Btw, the other problem is napi and softirq sticking
(http://comments.gmane.org/gmane.linux.network/322914). But i
already wrote about this problem and possible solution.

Anyway, it's like a bad idea to allocate cpu one after one and not to sparse 
irq vectors.

The solution is simple. Instead of using cpumask_first_and(), try to get
RANDOM bit. I wrote dirty realization that works for me. Of
course, it must be done in right way, but i have attached patch for
illustration. 

Hope, it help someone else....

Thank you!

-- 
Peter Kosyh

[-- Attachment #2: linux-3.10-irq-rr.patch --]
[-- Type: text/x-diff, Size: 1914 bytes --]

diff -Nur linux-3.10.65/arch/x86/kernel/apic/io_apic.c linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c
--- linux-3.10.65/arch/x86/kernel/apic/io_apic.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c	2015-05-14 10:29:45.618572555 +0300
@@ -1072,6 +1072,7 @@
 {
 	raw_spin_unlock(&vector_lock);
 }
+extern int cpumask_any_and_real(const struct cpumask *src1p, const struct cpumask *src2p);

 static int
 __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
@@ -1101,7 +1102,7 @@
 	/* Only try and allocate irqs on cpus that are present */
 	err = -ENOSPC;
 	cpumask_clear(cfg->old_domain);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = cpumask_any_and_real(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;

@@ -1135,7 +1136,7 @@
 		if (unlikely(current_vector == vector)) {
 			cpumask_or(cfg->old_domain, cfg->old_domain, tmp_mask);
 			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
-			cpu = cpumask_first_and(tmp_mask, cpu_online_mask);
+			cpu = cpumask_any_and_real(tmp_mask, cpu_online_mask);
 			continue;
 		}

diff -Nur linux-3.10.65/lib/cpumask.c linux-3.10.65-irq-rr/lib/cpumask.c
--- linux-3.10.65/lib/cpumask.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/lib/cpumask.c	2015-05-14 10:28:28.410574546 +0300
@@ -164,3 +164,31 @@
 	free_bootmem(__pa(mask), cpumask_size());
 }
 #endif
+
+int cpumask_any_and_real(const struct cpumask *src1p,
+		const struct cpumask *src2p)
+{
+	int i;
+	int n = 0;
+	static int seed = 0;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		n ++;
+	}
+
+	if (!n) /* no cpus */
+		return nr_cpu_ids;
+
+	n = (seed ^ jiffies) % n;
+
+	seed ++;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		if (!n)
+			return i;
+		n --;
+	}
+	return nr_cpu_ids;
+}
+
+EXPORT_SYMBOL(cpumask_any_and_real);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: __assign_irq_vector (x86) and irq vectors exhaust
  2015-05-18 12:57 p.kosyh
@ 2015-05-18 13:40 ` David Laight
  2015-05-18 13:53   ` 'p.kosyh@gmail.com'
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2015-05-18 13:40 UTC (permalink / raw)
  To: 'p.kosyh@gmail.com', netdev@vger.kernel.org

From: p.kosyh@gmail.com
> Sent: 18 May 2015 13:58
> Playing with 10Gb network adapters and /proc/irq/<nr>/smp_affinity
> we found that sometimes we can not move interrupt on selected cpu.
> 
> After digging source code we found, that
> arch/x86/kernel/apic/vector.c: __assign_irq_vector (4.0 kernel)
> allocates vectors in not optimal way.
> 
> For example, we have a 32 cpu system with lot of 10Gb cards (each of
> them has 32 msi-x irqs). Even if card is not used, it allocates an irq
> vector after probing (pci_enable_msix()). We have about ~200 vectors limit
> per cpu (on x86), and __assign_irq_vector allocates them filling cpus one
> by one (see at cpumask_first_and()):
...

It might help if the kernel APIs allowed a driver to request additional
MSI-X interrupts after probe time.

If a device supports 32 interrupts the driver can say that it only
needs (say) interrupts 0, 1 and 16 (and only these MSIX table slots
get filled with interrupt 'info') - but can't later allocate the
MSIX info for other interrupts.

I can't see anything in the MSIX spec that stops things working
that way.

	David


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: __assign_irq_vector (x86) and irq vectors exhaust
  2015-05-18 13:40 ` David Laight
@ 2015-05-18 13:53   ` 'p.kosyh@gmail.com'
  0 siblings, 0 replies; 5+ messages in thread
From: 'p.kosyh@gmail.com' @ 2015-05-18 13:53 UTC (permalink / raw)
  To: David Laight; +Cc: netdev@vger.kernel.org

Yes, the allocation of vectors during probing is not a problem. The
problem is described in bottom part of my message.

Problem is that we allocates cpu after cpu. One by pne.
For example, we have 200 irqs and irq domain is 0xffffffff (no numa, or
1 numa node and 32 cpus). While probing devices cpu0 will got all 200 irq slots.
The better solution is that we randomly (or round robin) fill all cpus.

Sorry for mu ugly english.

> > For example, we have a 32 cpu system with lot of 10Gb cards (each of
> > them has 32 msi-x irqs). Even if card is not used, it allocates an irq
> > vector after probing (pci_enable_msix()). We have about ~200 vectors limit
> > per cpu (on x86), and __assign_irq_vector allocates them filling cpus one
> > by one (see at cpumask_first_and()):
> ...
> 
> It might help if the kernel APIs allowed a driver to request additional
> MSI-X interrupts after probe time.
> 
> If a device supports 32 interrupts the driver can say that it only
> needs (say) interrupts 0, 1 and 16 (and only these MSIX table slots
> get filled with interrupt 'info') - but can't later allocate the
> MSIX info for other interrupts.
> 
> I can't see anything in the MSIX spec that stops things working
> that way.
> 
> 	David
> 

-- 
Peter Kosyh

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-18 13:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20150518124833.GA7512@peter-bsd.cuba.int>
2015-05-18 12:48 ` __assign_irq_vector (x86) and irq vectors exhaust p_kosyh
2015-05-18 12:48 ` p_kosyh
2015-05-18 12:57 p.kosyh
2015-05-18 13:40 ` David Laight
2015-05-18 13:53   ` 'p.kosyh@gmail.com'

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).