__assign_irq_vector (x86) and irq vectors exhaust

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: <p_kosyh@factor-ts.ru>
To: netdev@vger.kernel.org
Subject: __assign_irq_vector (x86) and irq vectors exhaust
Date: Mon, 18 May 2015 15:48:33 +0300	[thread overview]
Message-ID: <adn_166406_0_@factor-ts.ru> (raw)
Message-ID: <20150518124833.GA7512@peter-bsd.cuba.int> (raw)

[-- Attachment #1: Type: text/plain, Size: 2230 bytes --]

Hello all!

Playing with 10Gb network adapters and /proc/irq/<nr>/smp_affinity 
we found that sometimes we can not move interrupt on selected cpu.

After digging source code we found, that
arch/x86/kernel/apic/vector.c: __assign_irq_vector (4.0 kernel)
allocates vectors in not optimal way.

For example, we have a 32 cpu system with lot of 10Gb cards (each of
them has 32 msi-x irqs). Even if card is not used, it allocates an irq
vector after probing (pci_enable_msix()). We have about ~200 vectors limit 
per cpu (on x86), and __assign_irq_vector allocates them filling cpus one 
by one (see at cpumask_first_and()):

	...

	cpumask_clear(cfg->old_domain);
	cpu = cpumask_first_and(mask, cpu_online_mask);
	/* here we are got 1st non zero bit <----------- */
	while (cpu < nr_cpu_ids) {
		int new_cpu, vector, offset;

		apic->vector_allocation_domain(cpu, tmp_mask, mask);

		...

		if (unlikely(current_vector == vector)) {
		cpumask_or(cfg->old_domain, cfg->old_domain,
tmp_mask);
			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
			cpu = cpumask_first_and(tmp_mask,
cpu_online_mask);
			/* get next non zero bit <------------ */
			continue;
		}

		...

So, after our system is up, we have a situation when some cpus
has no free vectors at all!! And some cpus has all vectors free.

Userspace do not know nothing about this exhaust!!! So after writing 
mask to the smp_affinity we can got a situation that irq can not be moved.
Silently.

It is not a critical thing when you are doing all stuff by
hands, but if we are using irq balancer, like birq (http://birq.libcode.org) 
or any other, this problem becomes critical one! Balancer has not idea, why irq
is still not moved!!! Btw, the other problem is napi and softirq sticking
(http://comments.gmane.org/gmane.linux.network/322914). But i
already wrote about this problem and possible solution.

Anyway, it's like a bad idea to allocate cpu one after one and not to sparse 
irq vectors.

The solution is simple. Instead of using cpumask_first_and(), try to get
RANDOM bit. I wrote dirty realization that works for me. Of
course, it must be done in right way, but i have attached patch for
illustration. 

Hope, it help someone else....

Thank you!

-- 
Peter Kosyh

[-- Attachment #2: linux-3.10-irq-rr.patch --]
[-- Type: text/x-diff, Size: 1914 bytes --]

diff -Nur linux-3.10.65/arch/x86/kernel/apic/io_apic.c linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c
--- linux-3.10.65/arch/x86/kernel/apic/io_apic.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/arch/x86/kernel/apic/io_apic.c	2015-05-14 10:29:45.618572555 +0300
@@ -1072,6 +1072,7 @@
 {
 	raw_spin_unlock(&vector_lock);
 }
+extern int cpumask_any_and_real(const struct cpumask *src1p, const struct cpumask *src2p);

 static int
 __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
@@ -1101,7 +1102,7 @@
 	/* Only try and allocate irqs on cpus that are present */
 	err = -ENOSPC;
 	cpumask_clear(cfg->old_domain);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = cpumask_any_and_real(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;

@@ -1135,7 +1136,7 @@
 		if (unlikely(current_vector == vector)) {
 			cpumask_or(cfg->old_domain, cfg->old_domain, tmp_mask);
 			cpumask_andnot(tmp_mask, mask, cfg->old_domain);
-			cpu = cpumask_first_and(tmp_mask, cpu_online_mask);
+			cpu = cpumask_any_and_real(tmp_mask, cpu_online_mask);
 			continue;
 		}

diff -Nur linux-3.10.65/lib/cpumask.c linux-3.10.65-irq-rr/lib/cpumask.c
--- linux-3.10.65/lib/cpumask.c	2015-01-16 18:00:00.000000000 +0300
+++ linux-3.10.65-irq-rr/lib/cpumask.c	2015-05-14 10:28:28.410574546 +0300
@@ -164,3 +164,31 @@
 	free_bootmem(__pa(mask), cpumask_size());
 }
 #endif
+
+int cpumask_any_and_real(const struct cpumask *src1p,
+		const struct cpumask *src2p)
+{
+	int i;
+	int n = 0;
+	static int seed = 0;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		n ++;
+	}
+
+	if (!n) /* no cpus */
+		return nr_cpu_ids;
+
+	n = (seed ^ jiffies) % n;
+
+	seed ++;
+
+	for_each_cpu_and(i, src1p, src2p) { /* total number of cpus */
+		if (!n)
+			return i;
+		n --;
+	}
+	return nr_cpu_ids;
+}
+
+EXPORT_SYMBOL(cpumask_any_and_real);

next             reply	other threads:[~2015-05-18 12:55 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20150518124833.GA7512@peter-bsd.cuba.int>
2015-05-18 12:48 ` p_kosyh [this message]
2015-05-18 12:48 ` __assign_irq_vector (x86) and irq vectors exhaust p_kosyh
2015-05-18 12:57 p.kosyh
2015-05-18 13:40 ` David Laight
2015-05-18 13:53   ` 'p.kosyh@gmail.com'

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adn_166406_0_@factor-ts.ru \
    --to=p_kosyh@factor-ts.ru \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).